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The dotcom bubble may have finally burst but there can be no doubt that the Internet has forever changed the way 
we communicate, do business and find information of all kinds. Scientific American has regularly covered the 
advances making this transformation possible. And during the past five years alone, many leading researchers 
and computer scientists have aired their views on the Web in our pages. 


In this collection, expert authors discuss a range of topics—from XML and hypersearching the web to filtering 
information and preserving the Internet in one vast archive. Other articles cover more recent ideas, including 
ways to make Web content more meaningful to machines and plans to create an operating system that would 
span the Internet as a whole. --the Editors 
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FILTERING INFORMATION 
ON THE INTERNET 


Look for the labels to decide if unknown 
software and World Wide Web sites are safe and interesting 


he Internet is often 

called a global village, 

suggesting a huge but 

close-knit community 

that shares common 
values and experiences. The metaphor 
is misleading. Many cultures coexist on 
the Internet and at times clash. In its 
public spaces, people interact commer- 
cially and socially with strangers as well 
as with acquaintances and friends. The 
city is a more apt metaphor, with its 
suggestion of unlimited opportunities 
and myriad dangers. 

To steer clear of the most obviously 
offensive, dangerous or just boring neigh- 
borhoods, users can employ some me- 
chanical filtering techniques that identi- 
fy easily definable risks. One technique 
is to analyze the contents of on-line ma- 
terial. Thus, virus-detection software 
searches for code fragments that it 
knows are common in virus programs. 
Services such as AltaVista and Lycos can 
either highlight or exclude World Wide 
Web documents containing particular 
words. My colleagues and I have been 
at work on another filtering technique 
based on electronic labels that can be 
added to Web sites to describe digital 
works. These labels can convey charac- 
teristics that require human judgment— 
whether the Web page is funny or offen- 
sive—as well as information not readily 
apparent from the words and graphics, 
such as the Web site’s policies about the 
use or resale of personal data. 

The Massachusetts Institute of Tech- 
nology’s World Wide Web Consortium 
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has developed a set of technical stan- 
dards called PICS (Platform for Internet 
Content Selection) so that people can 
electronically distribute descriptions of 
digital works in a simple, computer- 
readable form. Computers can process 
these labels in the background, auto- 
matically shielding users from undesir- 
able material or directing their atten- 
tion to sites of particular interest. The 
original impetus for PICS was to allow 
parents and teachers to screen materials 
they felt were inappropriate for children 
using the Net. Rather than censoring 
what is distributed, as the Communica- 
tions Decency Act and other legislative 
initiatives have tried to do, PICS enables 
users to control what they receive. 


What’s in a Label? 


ICS labels can describe any aspect 

of a document or a Web site. The 
first labels identified items that might 
run afoul of local indecency laws. For 
example, the Recreational Software Ad- 
visory Council (RSAC) adapted its com- 
puter-game rating system for the Inter- 


FILTERING SYSTEM for the World Wide 
Web allows individuals to decide for them- 
selves what they want to see. Users speci- 
fy safety and content requirements (a), 
which label-processing software (b) then 
consults to determine whether to block ac- 


cess to certain pages (marked witha stop , 
sign). Labels can be affixed by the Web 4 
site’s author (c), or a rating agency can _ 
store its labels in a separate database (d). = 
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net. Each RSACi (the “i” stands for 
“Internet”) label has four numbers, in- 
dicating levels of violence, nudity, sex 
and potentially offensive language. An- 
other organization, SafeSurf, has devel- 
oped a vocabulary with nine separate 
scales. Labels can reflect other concerns 
beyond indecency, however. A privacy 
vocabulary, for example, could describe 
Web sites’ information practices, such 
as what personal information they col- 
lect and whether they resell it. Similarly, 
an intellectual-property vocabulary could 
describe the conditions under which an 
item could be viewed or reproduced [see 
“Trusted Systems,” by Mark Stefik, page 
78]. And various Web-indexing organi- 
zations could develop labels that indi- 
cate the subject categories or the relia- 
bility of information from a site. 

Labels could even help protect com- 
puters from exposure to viruses. It has 
become increasingly popular to down- 
load small fragments of computer code, 
bug fixes and even entire applications 
from Internet sites. People generally trust 


that the software they download will 
not introduce a virus; they could add a 
margin of safety by checking for labels 
that vouch for the software’s safety. The 
vocabulary for such labels might indi- 
cate which virus checks have been run 
on the software or the level of confidence 
in the code’s safety. 

In the physical world, labels can be 
attached to the things they describe, or 
they can be distributed separately. For 
example, the new cars in an automobile 
showroom display stickers describing 
features and prices, but potential cus- 
tomers can also consult independent 
listings such as consumer-interest mag- 
azines. Similarly, PICS labels can be at- 
tached or detached. An information pro- 
vider that wishes to offer descriptions 
of its own materials can directly embed 
labels in Web documents or send them 
along with items retrieved from the 
Web. Independent third parties can de- 
scribe materials as well. For instance, the 
Simon Wiesenthal Center, which tracks 
the activities of neo-Nazi groups, could 
publish PICS labels that identify Web 
pages containing neo-Nazi propaganda. 
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These labels would be stored on a sepa- 
rate server; not everyone who visits the 
neo-Nazi pages would see the Wiesen- 
thal Center labels, but those who were 
interested could instruct their software 
to check automatically for the labels. 

Software can be configured not mere- 
ly to make its users aware of labels but 
to act on them directly. Several Web soft- 
ware packages, including CyberPatrol 
and Microsoft’s Internet Explorer, al- 
ready use the PICS standard to control 
users’ access to sites. Such software can 
make its decisions based on any PICS- 
compatible vocabulary. A user who 
plugs in the RSACi vocabulary can set 
the maximum acceptable levels of lan- 
guage, nudity, sex and violence. A user 
who plugs in a software-safety vocabu- 
lary can decide precisely which virus 
checks are required. 

In addition to blocking unwanted 
materials, label processing can assist in 
finding desirable materials. If a user ex- 
presses a preference for works of high 
literary quality, a search engine might 
be able to suggest links to items labeled 
that way. Or if the user prefers that per- 
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sonal data not be collected or sold, a 
Web server can offer a version of its ser- 
vice that does not depend on collecting 
personal information. 


Establishing Trust 


N° every label is trustworthy. The 
creator of a virus can easily dis- 
tribute a misleading label claiming that 
the software is safe. Checking for labels 
merely converts the question of wheth- 
er to trust a piece of software to one of 
trusting the labels. One solution is to 
use cryptographic techniques that can 
determine whether a document has been 
changed since its label was created and 
to ensure that the label really is the work 
of its purported author. 

That solution, however, simply chang- 
es the question again, from one of trust- 
ing a label to one of trusting the label’s 
author. Alice may trust Bill’s labels if she 
has worked with him for years or if he 
runs a major software company whose 
reputation is at stake. Or she might trust 
an auditing organization of some kind 
to vouch for Bill. 
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(PICS-1.1 "http://www.w3.org/PICS/vocab.html”" 


Author of label seoets 


at = 


% by "paul@GoodMouseClicking.com" 


URL for the item 
being labeleg ———>| for "http: //www.w3.org/PICs" The expiration date 
— for this label 
_ generic is April 4, 199. 
This term means ——~? iF 
exp "1997.04.047T08:15-0500" 


that the label will apply 
to the entire directory 
of items available at 


http://www.w3.org/PICS 


— 


Of course, some labels address mat- 
ters of personal taste rather than points 
of fact. Users may find themselves not 
trusting certain labels, simply because 
they disagree with the opinions behind 
them. To get around this problem, sys- 
tems such as GroupLens and Firefly rec- 
ommend books, articles, videos or mu- 
sical selections based on the ratings of 
like-minded people. People rate items 
with which they are familiar, and the 
software compares those ratings with 
opinions registered by other users. In 
making recommendations, the software 
assigns the highest priority to items ap- 
proved by people who agreed with the 
user’s evaluations of other materials. 
People need not know who agreed with 
them; they can participate anonymous- 
ly, preserving the privacy of their evalu- 
ations and reading habits. 

Widespread reliance on labeling raises 
a number of social concerns. The most 
obvious are the questions of who de- 
cides how to label sites and what labels 
are acceptable. Ideally, anyone could la- 
bel a site, and everyone could establish 
individual filtering rules. But there is a 
concern that authorities could assign la- 
bels to sites or dictate criteria for sites 
to label themselves. In an example from 
a different medium, the television indus- 
try, under pressure from the U.S. gov- 
ernment, has begun to rate its shows for 
age appropriateness. 

Mandatory self-labeling need not 
lead to censorship, so long as individu- 
als can decide which labels to ignore. 
But people may not always have this 
power. Improved individual control re- 
moves one rationale for central control 
but does not prevent its imposition. 
Singapore and China, for instance, are 
experimenting with national “fire- 
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ratings (q 2 v 3) ) 


_The actual ratings 
¥ 
© for the directory: 


is set at level 3, 


walls”—combinations of software and 
hardware that block their citizens’ ac- 
cess to certain newsgroups and Web sites. 

Another concern is that even without 
central censorship, any widely adopted 
vocabulary will encourage people to 
make lazy decisions that do not reflect 
their values. Today many parents who 
may not agree with the criteria used to 
assign movie ratings still forbid their 
children to see movies rated PG-13 or 
R; it is too hard for them to weigh the 
merits of each movie by themselves. 

Labeling organizations must choose 
vocabularies carefully to match the cri- 
teria that most people care about, but 
even so, no single vocabulary can serve 
everyone’s needs. Labels concerned only 
with rating the level of sexual content 
at a site will be of no use to someone 
concerned about hate speech. And no 
labeling system is a full substitute for a 
thorough and thoughtful evaluation: 
movie reviews in a newspaper can be 
far more enlightening than any set of 
predefined codes. 

Perhaps most troubling is the sugges- 
tion that any labeling system, no matter 
how well conceived and executed, will 


literary quality is set at 
level 2, and violent content 


document at 
Lethis address, OF URL, 
defines the terms of 

the labeling vocabula ru: 

for instance, "qi" will 
stand for literary 
quality and "Vv" for 
violence. 


COMPUTER CODE for a 
PICS standards label is typi- 
cally read by label-processing 
software, not humans. This 
sample label rates both the 
literary quality and the vio- 
lent content of the Web site 
http://www.w3.org/PICS 


tend to stifle noncommercial communi- 
cation. Labeling requires human time 
and energy; many sites of limited inter- 
est will probably go unlabeled. Because 
of safety concerns, some people will 
block access to materials that are unla- 
beled or whose labels are untrusted. For 
such people, the Internet will function 
more like broadcasting, providing access 
only to sites with sufficient mass-mar- 
ket appeal to merit the cost of labeling. 

While lamentable, this problem is an 
inherent one that is not caused by label- 
ing. In any medium, people tend to 
avoid the unknown when there are 
risks involved, and it is far easier to get 
information about material that is of 
wide interest than about items that ap- 
peal to a small audience. 

Although the Net nearly eliminates 
the technical barriers to communica- 
tion with strangers, it does not remove 
the social costs. Labels can reduce those 
costs, by letting us control when we ex- 
tend trust to potentially boring or dan- 
gerous software or Web sites. The chal- 
lenge will be to let labels guide our ex- 
ploration of the global city of the 
Internet and not limit our travels. 
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PRESERVING 
THE INTERNET 


An archive of the Internet may prove to be a vital record for 


historians, businesses and governments 


anuscripts 

from the li- 

brary of Alex- 

andria in an- 

cient Egypt dis- 
appeared in a fire. The early printed 
books decayed into unrecognizable 
shreds. Many of the oldest cinematic 
films were recycled for their silver con- 
tent. Unfortunately, history may repeat 
itself in the evolution of the Internet— 
and its World Wide Web. 

No one has tried to capture a com- 
prehensive record of the text and imag- 
es contained in the documents that ap- 
pear on the Web. The history of print 
and film is a story of loss and partial re- 
construction. But this scenario need not 
be repeated for the Web, which has in- 
creasingly evolved into a storehouse of 
valuable scientific, cultural and histori- 
cal information. 

The dropping costs of digital storage 
mean that a permanent record of the 
Web and the rest of the Internet can be 
preserved by a small group of technical 
professionals equipped with a modest 
complement of computer workstations 
and data storage devices. A year ago I 
and a few others set out to realize this 
vision as part of a venture known as the 
Internet Archive. 

By the time this article is published, 
we will have taken a snapshot of all 
parts of the Web freely and technically 
accessible to us. This collection of data 
will measure perhaps as much as two 
trillion bytes (two terabytes) of data, 
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ranging from text to video to audio re- 
cording. In comparison, the Library of 
Congress contains about 20 terabytes of 
text information. In the coming months, 
our computers and storage media will 
make records of other areas of the In- 
ternet, including the Gopher informa- 
tion system and the Usenet bulletin 
boards. The material gathered so far 
has already proved a useful resource to 
historians. In the future, it may provide 
the raw material for a carefully indexed, 
searchable library. 

The logistics of taking a snapshot of 
the Web are relatively simple. Our Inter- 
net Archive operates with a staff of 10 
people from offices located in a convert- 
ed military base—the Presidio—in down- 
town San Francisco; it also runs an in- 
formation-gathering computer in the 
San Diego Supercomputer Center at the 
University of California at San Diego. 

The software on our computers 
“crawls” the Net—downloading docu- 
ments, called pages, from one site after 
another. Once a page is captured, the 
software looks for cross references, or 
links, to other pages. It uses the Web’s 
hyperlinks—addresses embedded with- 
in a document page—to move to other 
pages. The software then makes copies 
again and seeks additional links con- 
tained in the new pages. The crawler 
avoids downloading duplicate copies of 
pages by checking the identification 
names, called uniform resource locators 
(URLs), against a database. Programs 
such as Digital Equipment Corporation’s 
AltaVista also employ crawler software 
for indexing Web sites. 
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What makes this experiment possible 
is the dropping cost of data storage. The 
price of a gigabyte (a billion bytes) of 
hard-disk space is $200, whereas tape 
storage using an automated mounting 
device costs $20 a gigabyte. We chose 
hard-disk storage for a small amount of 
data that users of the archive are likely 
to access frequently and a robotic de- 
vice that mounts and reads tapes auto- 
matically for less used information. A 
disk drive accesses data in an average of 
15 milliseconds, whereas tapes require 
four minutes. Frequently accessed in- 
formation might be historical docu- 
ments or a set of URLs no longer in use. 

We plan to update the information 
gathered at least every few months. The 
first full record required nearly a year 
to compile. In future passes through the 
Web, we will be able to update only the 
information that has changed since our 
last perusal. 

The text, graphics, audio clips and 
other data collected from the Web will 
never be comprehensive, because the 
crawler software cannot gain access to 
many of the hundreds of thousands of 
sites. Publishers restrict access to data 
or store documents in a format inacces- 
sible to simple crawler programs. Still, 
the archive gives a feel of what the Web 
looks like during a given period of time 
even though it does not constitute a full 
record. 

After gathering and storing the public 
contents of the Internet, what services 
will the archive provide? We possess the 
capability of supplying documents that 
are no longer available from the origi- 
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nal publisher, an important function if 
the Web’s hypertext system is to become 
a medium for scholarly publishing. Such 
a service could also prove worthwhile 
for business research. And the archival 
data might serve as a “copy of record” 
for the government or other institutions 
with publicly available documents. So, 
over time, the archive would come to 
resemble a digital library. 


Keeping Missing Links 


istorians have already found the 
material useful. David Allison of 
the Smithsonian Institution has tapped 
into the archive for a presidential elec- 
tion Web site exhibit at the museum, a 
project he compares to saving video- 
tapes of early television campaign ad- 
vertisements. Many of the links for these 
Web sites, such as those for Texas Sena- 
tor Phil Gramm’s campaign, have al- 
ready disappeared from the Internet. 
Creating an archive touches on an ar- 
ray of issues, from privacy to copyright. 
What if a college student created a Web 
page that had pictures of her then cur- 
rent boyfriend? What if she later want- 
ed to “tear them up,” so to speak, yet 
they lived on in the archive? Should she 
have the right to remove them? In con- 
trast, should a public figure—a U.S. sen- 
ator, for instance—be able to erase data 
posted from his or her college years? 
Does collecting information made avail- 
able to the public violate the “fair use” 
provisions of the copyright law? The is- 
sues are not easily resolved. 
To address these worries, we let au- 


The Future of the Web 


thors exclude their works from the ar- 
chive. We are also considering allowing 
researchers to obtain broad censuses of 
the archive data instead of individual 
documents—one could count the total 
number of references to pachyderms on 
the Web, for instance, but not look at a 
specific elephant home page. These mea- 
sures, we hope, will suffice to allay im- 
mediate concerns about privacy and in- 
tellectual-property rights. Over time, the 
issues addressed in setting up the Inter- 
net Archive might help resolve the larg- 
er policy debates on intellectual proper- 
ty and privacy by testing concepts such 
as fair use on the Internet. 

The Internet Archive complements 
other projects intended to ensure the 
longevity of information on the Internet. 
The Commission on Preservation and 
Access in Washington, D.C., researches 
how to ensure that data are not lost as 
the standard formats for digital storage 
media change over the years. In another 
effort, the Internet Engineering Task 
Force and other groups have labored on 
technical standards that give a unique 
identification name to digital documents. 
These uniform resource names (URNs), 
as they are called, could supplement the 
URLs that currently access Web docu- 
ments. Giving a document a URN at- 
tempts to ensure that it can be traced 
after a link disappears, because estimates 
put the average lifetime for a URL at 44 
days. The URN would be able to locate 
other URLs that still provided access to 
the desired documents. 

Other, more limited attempts to ar- 
chive parts of the Internet have also be- 
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gun. DejaNews keeps a record of mes- 
sages on the Usenet bulletin boards, and 
InReference archives Internet mailing 
lists. Both support themselves with rev- 
enue from advertisers, a possible fund- 
ing source for the Internet Archive as 
well. Until now, I have funded the proj- 
ect with money I received from the sale 
of an Internet software and services 
company. Major computer companies 
have also donated equipment. 

It will take many years before an in- 
frastructure that assures Internet preser- 
vation becomes well established—and 
for questions involving intellectual-prop- 
erty issues to resolve themselves. For our 
part, we feel that it is important to pro- 
ceed with the collection of the archival 
material because it can never be recov- 
ered in the future. And the opportunity 
to capture a record of the birth of a 
new medium will then be lost. 
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SEARCHING THE INTERNET 


Combining the skills of the librarian and the computer scientist 
may help organize the anarchy of the Internet 


ne sometimes hears the 

Internet characterized 

as the world’s library 

for the digital age. This 

description does not 
stand up under even casual examina- 
tion. The Internet—and particularly its 
collection of multimedia resources 
known as the World Wide Web—was 
not designed to support the organized 
publication and retrieval of informa- 
tion, as libraries are. It has evolved into 
what might be thought of as a chaotic 
repository for the collective output of 
the world’s digital “printing presses.” 
This storehouse of information con- 
tains not only books and papers but 
raw scientific data, menus, meeting 
minutes, advertisements, video and au- 
dio recordings, and transcripts of inter- 
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active conversations. The ephemeral 
mixes everywhere with works of lasting 
importance. 

In short, the Net is not a digital libra- 
ry. But if it is to continue to grow and 
thrive as a new means of communica- 
tion, something very much like tradi- 
tional library services will be needed to 
organize, access and preserve networked 
information. Even then, the Net will not 
resemble a traditional library, because 
its contents are more widely dispersed 
than a standard collection. Consequent- 
ly, the librarian’s classification and se- 
lection skills must be complemented by 
the computer scientist’s ability to auto- 
mate the task of indexing and storing 
information. Only a synthesis of the 
differing perspectives brought by both 
professions will allow this new medium 
to remain viable. 


SEARCH ENGINE operates by visi 


World Wide Web sites, pictured as blue ; 


lines represent the output from and inp 
tower at center), where Web pages are d 
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At the moment, computer technology 
bears most of the responsibility for or- 
ganizing information on the Internet. In 
theory, software that automatically 
classifies and indexes collections of dig- 
ital data can address the glut of infor- 
mation on the Net—and the inability of 
human indexers and bibliographers to 
cope with it. Automating information 
access has the advantage of directly ex- 
ploiting the rapidly dropping costs of 
computers and avoiding the high ex- 
pense and delays of human indexing. 

But, as anyone who has ever sought 
information on the Web knows, these 
automated tools categorize information 
differently than people do. In one sense, 
the job performed by the various index- 
ing and cataloguing tools known as 
search engines is highly democratic. Ma- 
chine-based approaches provide uniform 
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and equal access to all the in- APPROXIMATE ‘com SITES 
formation on the Net. In prac- NUMBER (PERCENT OF ALL SITES) 
tice, this electronic egalitarian- GEWES SUES 01g 203040506070 
ism can prove a mixed bless- SYNE 1993 130 
ing. Web “surfers” who type _DEC:1993 a 
ies YP€ —JUNE1994 —2,740 
in a search request are often DEC. 1994 10,000 
overwhelmed by thousands of — JUNE 1995 23,500 
responses. The search results JAN. 1996 100,000 
frequently contain referencesto  JUNE1996 —-230,000 
JAN. 1997 650,000 


irrelevant Web sites while leav- 
ing out others that hold impor- 
tant material. 


Crawling the Web 
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he nature of electronic in- JAN. 1995 . 
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by examining the way Web 
search engines, such as Lycos 
or Digital Equipment Corpora- 
tion’s AltaVista, construct in- 
dexes and find information re- 
quested by a user. Periodically, they dis- 
patch programs (sometimes referred to 
as Web crawlers, spiders or indexing ro- 
bots) to every site they can identify on 
the Web—each site being a set of docu- 
ments, called pages, that can be accessed 
over the network. The Web crawlers 
download and then examine these pag- 
es and extract indexing information that 
can be used to describe them. This pro- 
cess—details of which vary among search 
engines—may include simply locating 
most of the words that appear in Web 
pages or performing sophisticated anal- 
yses to identify key words and phrases. 
These data are then stored in the search 
engine’s database, along with an ad- 
dress, termed a uniform resource loca- 
tor (URL), that represents where the file 
resides. A user then deploys a browser, 
such as the familiar Netscape, to submit 
queries to the search engine’s database. 
The query produces a list of Web re- 
sources, the URLs that can be clicked 
on to connect to the sites identified by 
the search. 

Existing search engines service mil- 
lions of queries a day. Yet it has become 
clear that they are less than ideal for re- 
trieving an ever growing body of infor- 
mation on the Web. In contrast to hu- 
man indexers, automated programs 
have difficulty identifying characteris- 
tics of a document such as its overall 
theme or its genre—whether it is a poem 
or a play, or even an advertisement. 

The Web, moreover, still lacks stan- 
dards that would facilitate automated 
indexing. As a result, documents on the 
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GROWTH AND CHANGE on the Internet are reflected in 
the burgeoning number of Web sites, host computers and 
commercial, or “.com,” sites. 


Web are not structured so that programs 
can reliably extract the routine informa- 
tion that a human indexer might find 
through a cursory inspection: author, 
date of publication, length of text and 
subject matter. (This information is 
known as metadata.) A Web crawler 
might turn up the desired article au- 
thored by Jane Doe. But it might also 
find thousands of other articles in which 
such a common name is mentioned in 
the text or in a bibliographic reference. 
Publishers sometimes abuse the indis- 
criminate character of automated index- 
ing. A Web site can bias the selection 
process to attract attention to itself by 
repeating within a document a word, 
such as “sex,” that is known to be quer- 
ied often. The reason: a search engine 
will display first the URLs for the docu- 
ments that mention a search term most 
frequently. In contrast, humans can eas- 
ily see around simpleminded tricks. 
The professional indexer can describe 
the components of individual pages of 
all sorts (from text to video) and can 
clarify how those parts fit together into 
a database of information. Civil War 
photographs, for example, might form 
part of a collection that also includes 
period music and soldier diaries. A hu- 
man indexer can describe a site’s rules 
for the collection and retention of pro- 
grams in, say, an archive that stores 
Macintosh software. Analyses of a site’s 
purpose, history and policies are beyond 
the capabilities of a crawler program. 
Another drawback of automated in- 
dexing is that most search engines rec- 
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ognize text only. The intense 
interest in the Web, though, has 
come about because of the me- 
dium’s ability to display imag- 
es, whether graphics or video 
clips. Some research has moved 
forward toward finding colors 
or patterns within images [see 
box on next two pages]. But no 
program can deduce the un- 
derlying meaning and cultural 
significance of an image (for ex- 
ample, that a group of men din- 
ing represents the Last Supper). 

At the same time, the way 
information is structured on 
the Web is changing so that it 
often cannot be examined by 
Web crawlers. Many Web pag- 
es are no longer static files that 
can be analyzed and indexed by 
such programs. In many cases, 
the information displayed in a docu- 
ment is computed by the Web site dur- 
ing a search in response to the user’s re- 
quest. The site might assemble a map, a 
table and a text document from differ- 
ent areas of its database, a disparate 
collection of information that conforms 
to the user’s query. A newspaper’s Web 
site, for instance, might allow a reader to 
specify that only stories on the oil-equip- 
ment business be displayed in a person- 
alized version of the paper. The database 
of stories from which this document is 
put together could not be searched by a 
Web crawler that visits the site. 

A growing body of research has at- 
tempted to address some of the prob- 
lems involved with automated classifi- 
cation methods. One approach seeks to 
attach metadata to files so that index- 
ing systems can collect this information. 
The most advanced effort is the Dublin 
Core Metadata program and an affiliat- 
ed endeavor, the Warwick Framework— 
the first named after a workshop in 
Dublin, Ohio, the other for a colloquy 
in Warwick, England. The workshops 
have defined a set of metadata elements 
that are simpler than those in traditional 
library cataloguing and have also creat- 
ed methods for incorporating them 
within pages on the Web. 

Categorization of metadata might 
range from title or author to type of 
document (text or video, for instance). 
Either automated indexing software or 
humans may derive the metadata, which 
can then be attached to a Web page for 
retrieval by a crawler. Precise and de- 
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Finding Pictures on the Web 


by Gary Stix, staff writer 


he Internet came into its own a few years ago, when the 

World Wide Web arrived with its dazzling array of photogra- 
phy, animation, graphics, sound and video that ranged in subject 
matter from high art to the patently lewd. Despite the multimedia 
barrage, finding things on the hundreds of thousands of Web sites 
still mostly requires searching indexes for words and numbers. 

Someone who types the words “French flag” into the popular 
search engine AltaVista might retrieve the requested graphic, as 
long as it were captioned by those two identifying words. But what 
if someone could visualize a blue, white and red banner but did 
not know its country of origin? 

Ideally, a search engine should allow the user to draw or scan in 
a rectangle with vertical thirds that are colored blue, white and 
red—and then find any matching images stored on myriad Web 
sites. In the past few years, techniques that combine key-word in- 
dexing with image analysis have begun to pave the way for the 
first image search engines. 

Although these prototypes suggest possibilities for the indexing 
of visual information, they also demonstrate the crudeness of ex- 
isting tools and the continuing reliance on text to track down im- 
agery. One project, called WebSEEk, based at Columbia University, 
illustrates the workings of an image search engine. WebSEEk be- 
gins by downloading files found by trolling the Web. It then at- 
tempts to locate file names containing acronyms, such as GIF or 
MPEG, that designate graphics or video content. It also looks for 
words in the names that might identify the subject of the files. 
When the software finds an image, it analyzes the prevalence of 
different colors and where they are located. Using this information, 
it can distinguish among photographs, graphics and black-and- 
white or gray images. The software also compresses each picture 
so that it can be represented as an icon, a miniature image for dis- 
play alongside other icons. For a video, it will extract key frames 
from different scenes. 

A user begins a search by selecting a category from a menu— 
“cats,” for example. WebSEEk provides a sampling of icons for the 
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HUMAN | 
INDEXING 


AUTOMATED INDEXING, used by 
Web crawler software, analyzes a page 
(left panel) by designating most words as 
indexing terms (top center) or by grouping 
words into simple phrases (bottom cen- 
ter). Human indexing (right) gives addi- 
tional context about the subject of a page. 


tailed human annotations can provide a 
more in-depth characterization of a 
page than can an automated indexing 
program alone. 

Where costs can be justified, human 
indexers have begun the laborious task 
of compiling bibliographies of some 
Web sites. The Yahoo database, a com- 
mercial venture, classifies sites by broad 
subject area. And a research project at 
the University of Michigan is one of 


“cats” category. To narrow 
the search, the user can 
click on any icons that 
show black cats. Using its 
previously generated col- 
or analysis, the search en- 
gine looks for matches of 
images that have a similar 
color profile. The presen- 
tation of the next set of 
icons may show black 
cats—but also some mar- 
malade cats sitting on 
black cushions. A visitor 
to WebSEEk can refine a 
search by adding or ex- 
cluding certain colors from an image when initiating subsequent 
queries. Leaving out yellows or oranges might get rid of the odd 
marmalade. More simply, when presented with a series of icons, 
the user can also specify those images that do not contain black 
cats in order to guide the program away from mistaken choices. So 
far WebSEEk has downloaded and indexed more than 650,000 pic- 
tures from tens of thousands of Web sites. 

Other image-searching projects include efforts at the University 
of Chicago, the University of California at San Diego, Carnegie Mel- 
lon University, the Massachusetts Institute of Technology's Media 
Lab and the University of California at Berkeley. A number of com- 
mercial companies, including IBM and Virage, have crafted soft- 
ware that can be used for searching corporate networks or data- 
bases. And two companies—Excalibur Technologies and Interpix 
Software—have collaborated to supply software to the Web-based 
indexing concerns Yahoo and Infoseek. 

One of the oldest image searchers, IBM's Query by Image Con- 
tent (QBIC), produces more sophisticated matching of image fea- 
tures than, say, WebSEEk can. It is able not only to pick out the col- 
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several efforts to develop more formal 
descriptions of sites that contain mate- 
rial of scholarly interest. 


Not Just a Library 


he extent to which either human 

classification skills or automated 
indexing and searching strategies are 
needed will depend on the people who 
use the Internet and on the business 
prospects for publishers. For many com- 
munities of scholars, the model of an 
organized collection—a digital library— 
still remains relevant. For other groups, 
an uncontrolled, democratic medium 
may provide the best vehicle for infor- 
mation dissemination. Some users, from 
financial analysts to spies, want com- 


prehensive access to raw databases of 
information, free of any controls or 
editing. For them, standard search en- 
gines provide real benefits because they 
forgo any selective filtering of data. 
The diversity of materials on the Net 
goes far beyond the scope of the tradi- 
tional library. A library does not pro- 
vide quality rankings of the works in a 
collection. Because of the greater vol- 
ume of networked information, Net us- 
ers want guidance about where to spend 
the limited amount of time they have to 
research a subject. They may need to 
know the three “best” documents for a 
given purpose. They want this informa- 
tion without paying the costs of em- 
ploying humans to critique the myriad 
Web sites. One solution that again calls 


for human involvement is to share judg- 
ments about what is worthwhile. Soft- 
ware-based rating systems have begun 
to let users describe the quality of par- 
ticular Web sites [see “Filtering Infor- 
mation on the Internet,” by Paul Res- 
nick, page 62]. 

Software tools search the Internet and 
also separate the good from the bad. 
New programs may be needed, though, 
to ease the burden of feeding the crawl- 
ers that repeatedly scan Web sites. Some 
Web site managers have reported that 
their computers are spending enormous 
amounts of time in providing crawlers 
with information to index, instead of 
servicing the people they hope to at- 
tract with their offerings. 

To address this issue, Mike Schwartz 
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ors in an image but also to gauge texture by several measures— 
contrast (the black and white of zebra stripes), coarseness (stones 
versus pebbles) and directionality (linear fence posts versus omni- 
directional flower petals). QBIC also has a limited ability to search 
for shapes within an image. Specifying a pink dot on a green back- 
ground turns up flowers and other photographs with similar 
shapes and colors, as shown above. Possible applications range 
from the selection of wallpaper patterns to enabling police to 
identify gang members by clothing type. 

All these programs do nothing more than match one visual fea- 
ture with another. They still require a human observer—or accom- 
panying text—to confirm whether an object is a cat or a cushion. 
For more than a decade, the artificial-intelligence community has 
labored, with mixed success, on nudging computers to ascertain 
directly the identity of objects within an image, whether they are 
cats or national flags. This approach correlates the shapes in a pic- 
ture with geometric models of real-world objects. The program 
can then deduce that a pink or brown cylinder, say, isa human arm. 

One example is software that looks for naked people, a pro- 
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gram that is the work of David A. Forsyth of Berkeley and Margaret 
M. Fleck of the University of lowa. The software begins by analyz- 
ing the color and texture of a photograph. When it finds matches 
for flesh colors, it runs an algorithm that looks for cylindrical areas 
that might correspond to an arm or leg. It then seeks other flesh- 
colored cylinders, positioned at certain angles, which might con- 
firm the presence of limbs. In a test last fall, the program picked 
out 43 percent of the 565 naked people among a group of 4,854 
images, a high percentage for this type of complex image analysis. 
It registered, moreover, only a 4 percent false positive rate among 
the 4,289 images that did not contain naked bodies. The nudes 
were downloaded from the Web; the other photographs came 
primarily from commercial databases. 

The challenges of computer vision will most likely remain for a 
decade or so to come. Searches capable of distinguishing clearly 
among nudes, marmalades and national flags are still an unreal- 
ized dream. As time goes on, though, researchers would like to 
give the programs that collect information from the Internet the 
ability to understand what they see. 
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and his colleagues at the University of 
Colorado at Boulder developed soft- 
ware, called Harvest, that lets a Web 
site compile indexing data for the pages 
it holds and to ship the information on 
request to the Web sites for the various 
search engines. In so doing, Harvest’s 
automated indexing program, or gath- 
erer, can avoid having a Web crawler 
export the entire contents of a given site 
across the network. 

Crawler programs bring a copy of 
each page back to their home sites to ex- 
tract the terms that make up an index, a 
process that consumes a great deal of 
network capacity (bandwidth). The gath- 
erer, instead, sends only a file of index- 
ing terms. Moreover, it exports only in- 
formation about those pages that have 
been altered since they were last ac- 


processed into an index (tan page) for q 


cessed, thus alleviating the load on the 
network and the computers tied to it. 

Gatherers might also serve a different 
function. They may give publishers a 
framework to restrict the information 
that gets exported from their Web sites. 
This degree of control is needed because 
the Web has begun to evolve beyond a 
distribution medium for free informa- 
tion. Increasingly, it facilitates access to 
proprietary information that is furnished 
for a fee. This material may not be open 
for the perusal of Web crawlers. Gath- 
erers, though, could distribute only the 
information that publishers wish to 
make available, such as links to sum- 
maries or samples of the information 
stored at a site. 

As the Net matures, the decision to 
opt for a given information collection 


HARVEST, a new search-engine architecture, 
software called gatherers that reside at Web site 
erate in a central computer (brown hexagon). 
avoid downloading all the documents from 
work traffic. The search engine’s server (red 
the gatherers (dark blue arrows) for a file 


method will depend mostly on users. 
For which users will it then come to re- 
semble a library, with a structured ap- 
proach to building collections? And for 
whom will it remain anarchic, with ac- 
cess supplied by automated systems? 

Users willing to pay a fee to under- 
write the work of authors, publishers, 
indexers and reviewers can sustain the 
tradition of the library. In cases where 
information is furnished without charge 
or is advertiser supported, low-cost com- 
puter-based indexing will most likely 
dominate—the same unstructured envi- 
ronment that characterizes much of the 
contemporary Internet. Thus, social and 
economic issues, rather than technolog- 
ical ones, will exert the greatest influence 
in shaping the future of information re- 
trieval on the Internet. 
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Second-Generation 


The combination of hypertext and a 
global Internet started a revolution. 


A new ingredient, XML, is 
poised to finish the job 


"ive people a few hints, and 


LT they can figure out the rest. 
“AW They can look at this page, 
see some large type followed by blocks 
of small type and know that they are 
looking at the start of a magazine article. 
They can look at a list of groceries and 
see shopping instructions. They can look 
at some rows of numbers and under- 
stand the state of their bank account. 

Computers, of course, are not that 
smart; they need to be told exactly what 
things are, how they are related and how 
to deal with them. Extensible Markup 
Language (XML for short) is a new lan- 
guage designed to do just that, to make 
information self-describing. This sim- 
ple-sounding change in how computers 
communicate has the potential to extend 
the Internet beyond information delivery 
to many other kinds of human activity. 
Indeed, since XML was completed in 
early 1998 by the World Wide Web Con- 
sortium (usually called the W3C), the 
standard has spread like wildfire through 
science and into industries ranging from 
manufacturing to medicine. 

The enthusiastic response is fueled by 
a hope that XML will solve some of the 
Web’s biggest problems. These are wide- 
ly known: the Internet is a speed-of-light 
network that often moves at a crawl; 
and although nearly every kind of in- 
formation is available on-line, it can be 
maddeningly difficult to find the one 
piece you need. 

Both problems arise in large part from 
the nature of the Web’s main language, 
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XML BRIDGES the incompatibilities of computer 
systems, allowing people to search for and exchange 
scientific data, commercial products and multilin- 
gual documents with greater ease and speed. 
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ILLUSTRATIONS BY BRUCIE ROSCH 


HTML (shorthand for Hypertext Mark- 
up Language). Although HTML is the 
most successful electronic-publishing lan- 
guage ever invented, it is superficial: in 
essence, it describes how a Web brows- 
er should arrange text, images and push- 
buttons on a page. HTMUs concern with 
appearances makes it relatively easy to 
learn, but it also has its costs. 

One is the difficulty in creating a Web 
site that functions as more than just a 
fancy fax machine that sends documents 
to anyone who asks. People and compa- 
nies want Web sites that take orders from 
customers, transmit medical records, 
even run factories and scientific instru- 
ments from half a world away. HTML 
was never designed for such tasks. 

So although your doctor may be able 
to pull up your drug reaction history on 
his Web browser, he cannot then e-mail 
it to a specialist and expect her to be able 
to paste the records directly into her hos- 
pital’s database. Her computer would 


<movie> 
<title>Star Trek: Insurrection</title> 
<star>Patrick Stewart</star> 
<star>Brent Spiner</star> 
<theatre> 
<theatre-name>MondoPlex 2000</theatre-name> 
<showtime>1415</showtime> 
<showtime>1630</showtime> 
<showtime>1845</showtime> 
<showtime>2100</showtime> 
<showtime>2315</showtime> 
<price> 
<adult-price>8 .50</-price> 
<child-price>5.00</-price> 
</price> 
</theatre> 
<theatre> 
<theatre-name>Bigscreen 1</theatre-name> 
<showtime>1930</showtime> 
<price> 
<adult-price>6.00</adult-price> 
</price> 
</theatre> 
</movie> 
<movie> 
<title>Shakespeare in Love</title> 


<star- 
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not know what to make of the infor- 
mation, which to its eyes would be no 
more intelligible than <H1>blah blah 
</H1> <BOLD>blah blah blah </BOLD>. 
As programming legend Brian Kerni- 
ghan once noted, the problem with 
“What You See Is What You Get” is 
that what you see is all you’ve got. 
Those angle-bracketed labels in the ex- 
ample just above are called tags. HTML 
has no tag for a drug reaction, which 
highlights another of its limitations: it is 
inflexible. Adding a new tag involves a 
bureaucratic process that can take so 
long that few attempt it. And yet every 
application, not just the interchange of 
medical records, needs its own tags. 
Thus the slow pace of today’s on-line 
bookstores, mail-order catalogues and 
other interactive Web sites. Change the 
quantity or shipping method of your 
order, and to see the handful of digits 
that have changed in the total, you 
must ask a distant, overburdened server 
to send you an entirely new page, graph- 
ics and all. Meanwhile your own high- 


MARKED UP WITH XML TAGS, one file— 
containing, say, movie listings for an entire city— 
can be displayed on a wide variety of devices. 
“Stylesheets” can filter, reorder and render the s 
listings as a Web page with graphics foradesktop 

computer, as a text-only list for a handheld orga- 
nizer and even as audible speech for a telephone. 
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<Together XML and XSL allow publishers to pour a publication into 
myriad forms—write once and publish everywhere. /> 


powered machine sits waiting idly, be- 
cause it has only been told about <H1>s 
and <BOLD>s, not about prices and 
shipping options. 

Thus also the dissatisfying quality of 
Web searches. Because there is no way 
to mark something as a price, it is effec- 
tively impossible to use price informa- 
tion in your searches. 


Something Old, Something New 
T he solution, in theory, is very sim- 
1. ple: use tags that say what the in- 
formation is, not what it looks like. For 
example, label the parts of an order for 
a shirt not as boldface, paragraph, row 
and column—what HTML offers—but 
as price, size, quantity and color. A pro- 
gram can then recognize this document 
as a customer order and do whatever it 
needs to do: display it one way or dis- 
play it a different way or put it through a 
bookkeeping system or make a new shirt 
show up on your doorstep tomorrow. 

We, as members of a dozen-strong 
W3C working group, began crafting 
such a solution in 1996. Our idea was 
powerful but not entirely original. For 
generations, printers scribbled notes on 
manuscripts to instruct the typesetters. 
This “markup” evolved on its own until 
1986, when, after decades of work, the 
International Organization for Standard- 
ization (ISO) approved a system for the 
creation of new markup languages. 

Named Standard Generalized Mark- 
up Language, or SGML, this language 
for describing languages—a metalan- 
guage—has since proved useful in many 
large publishing applications. Indeed, 
HTML was defined using SGML. The 
only problem with SGML is that it is 
too general—full of clever features de- 
signed to minimize keystrokes in an era 
when every byte had to be accounted for. 
It is more complex than Web browsers 
can cope with. 

Our team created XML by removing 
frills from SGML to arrive at a more 
streamlined, digestible metalanguage. 
XML consists of rules that anyone can 
follow to create a markup language from 
scratch. The rules ensure that a single 
compact program, often called a parser, 
can process all these new languages. 

Consider again the doctor who wants 
to e-mail your medical record to a spe- 
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cialist. If the medical profession uses 
XML to hammer out a markup lan- 
guage for encoding medical records— 
and in fact several groups have already 
started work on this—then your doctor’s 
e-mail could contain <patient> <name> 
blah blah </name> <drug-allergy> blah 
blah blah </drug-allergy> </patient>. 
Programming any computer to recog- 
nize this standard medical notation and 
to add this vital statistic to its database 
becomes straightforward. 

Just as HTML created a way for every 
computer user to read Internet docu- 
ments, XML makes it possible, despite 
the Babel of incompatible computer 
systems, to create an Esperanto that all 
can read and write. Unlike most com- 
puter data formats, XML markup also 
makes sense to humans, because it con- 
sists of nothing more than ordinary text. 

The unifying power of XML arises 
from a few well-chosen rules. One is 
that tags almost always come in pairs. 
Like parentheses, they surround the 
text to which they apply. And like quo- 
tation marks, tag pairs can be nested in- 
side one another to multiple levels. 

The nesting rule automatically forces 
a certain simplicity on every XML 
document, which takes on the structure 
known in computer science as a tree. 
As with a genealogical tree, each graph- 
ic and bit of text in the document repre- 
sents a parent, child or sibling of some 
other element; relationships are unam- 
biguous. Trees cannot represent every 
kind of information, but they can repre- 
sent most kinds that we need comput- 
ers to understand. Trees, moreover, are 
extraordinarily convenient for pro- 
grammers. If your bank statement is in 
the form of a tree, it is a simple matter 
to write a bit of software that will re- 
order the transactions or display just 
the cleared checks. 

Another source of XMUs unifying 
strength is its reliance on a new standard 
called Unicode, a character-encoding 
system that supports intermingling of 
text in all the world’s major languages. 
In HTML, as in most word processors, 
a document is generally in one particular 
language, whether that be English or Jap- 
anese or Arabic. If your software cannot 
read the characters of that language, 
then you cannot use the document. The 
situation can be even worse: software 
made for use in Taiwan often cannot 
read mainland-Chinese texts because of 
incompatible encodings. But software 
that reads XML properly can deal with 
any combination of any of these charac- 
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ter sets. Thus, XML enables exchange 
of information not only between differ- 
ent computer systems but also across 
national and cultural boundaries. 


An End to the World Wide Wait 


\.s XML spreads, the Web should be- 
{“\ come noticeably more responsive. 
At present, computing devices connect- 
ed to the Web, whether they are power- 
ful desktop computers or tiny pocket 
planners, cannot do much more than 
get a form, fill it out and then swap it 
back and forth with a Web server until 
a job is completed. But the structural 
and semantic information that can be 
added with XML allows these devices 
to do a great deal of processing on the 
spot. That not only will take a big load 
off Web servers but also should reduce 
network traffic dramatically. 

To understand why, imagine going to 
an on-line travel agency and asking for 
all the flights from London to New York 
on July 4. You would probably receive a 
list several times longer than your screen 
could display. You could shorten the list 
by fine-tuning the departure time, price 
or airline, but to do that, you would 


< XML enables exchange of infor- 
mation not only between different 
computer systems but also across 
national and cultural boundaries. /> 


have to send a request across the Inter- 
net to the travel agency and wait for its 
answer. If, however, the long list of 
flights had been sent in XML, then the 
travel agency could have sent a small 
Java program along with the flight rec- 
ords that you could use to sort and win- 
now them in microseconds, without ever 
involving the server. Multiply this by a 
few million Web users, and the global 
efficiency gains become dramatic. 

As more of the information on the Net 
is labeled with industry-specific XML 
tags, it will become easier to find exactly 
what you need. Today an Internet search 
for “stockbroker jobs” will inundate 
you with advertisements but probably 
turn up few job listings—most will be 
hidden inside the classified ad services 
of newspaper Web sites, out of a search 
robot’s reach. But the Newspaper Asso- 
ciation of America is even now building 
an XML-based markup language for 
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classified ads that promises to make 
such searches much more effective. 

Even that is just an intermediate step. 
Librarians figured out a long time ago 
that the way to find information in a 
hurry is to look not at the information 
itself but rather at much smaller, more 
focused sets of data that guide you to 
the useful sources: hence the library 
card catalogue. Such information about 
information is called metadata. 

From the outset, part of the XML 
project has been to create a sister stan- 
dard for metadata. The Resource De- 
scription Framework (RDF), finished 
this past February, should do for Web 
data what catalogue cards do for li- 
brary books. Deployed across the Web, 
RDF metadata will make retrieval far 
faster and more accurate than it is now. 
Because the Web has no librarians and 
every Webmaster wants, above all else, 
to be found, we expect that RDF will 
achieve a typically astonishing Internet 
growth rate once its power becomes 
apparent. 

There are of course other ways to find 
things besides searching. The Web is after 
all a “hypertext,” its billions of pages 
connected by hyperlinks—those under- 
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lined words you click on to get whisked 
from one to the next. Hyperlinks, too, 
will do more when powered by XML. 
A standard for XML-based hypertext, 
named XLink and due later this year 
from the W3C, will allow you to choose 
from a list of multiple destinations. Oth- 
er kinds of hyperlinks will insert text or 
images right where you click, instead of 
forcing you to leave the page. 

Perhaps most useful, XLink will en- 
able authors to use indirect links that 
point to entries in some central database 
rather than to the linked pages them- 
selves. When a page’s address changes, 
the author will be able to update all the 
links that point to it by editing just one 
database record. This should help elim- 
inate the familiar “404 File Not Found” 
error that signals a broken hyperlink. 

The combination of more efficient 
processing, more accurate searching and 
more flexible linking will revolutionize 
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the structure of the Web and make pos- 
sible completely new ways of accessing 
information. Users will find this new 
Web faster, more powerful and more 
useful than the Web of today. 


Some Assembly Required 


\ f course, it is not quite that simple. 
XML does allow anyone to design 
a new, custom-built language, but de- 
signing good languages is a challenge 
that should not be undertaken lightly. 
And the design is just the beginning: the 
meanings of your tags are not going to 
be obvious to other people unless you 
write some prose to explain them, nor 
to computers unless you write some 
software to process them. 

A moment’s thought reveals why. If 
all it took to teach a computer to handle 
a purchase order were to label it with 
<purchase-order> tags, we wouldn’t 
need XML. We wouldn’t even need pro- 
grammers—the machines would be 
smart enough to take care of themselves. 

What XML does is less magical but 
quite effective nonetheless. It lays down 
ground rules that clear away a layer of 
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programming details so that people with 
similar interests can concentrate on the 
hard part—agreeing on how they want 
to represent the information they com- 
monly exchange. This is not an easy 
problem to solve, but it is not a new one, 
either. 

Such agreements will be made, be- 
cause the proliferation of incompatible 
computer systems has imposed delays, 
costs and confusion on nearly every 
area of human activity. People want to 
share ideas and do business without all 
having to use the same computers; ac- 
tivity-specific interchange languages go 
a long way toward making that possi- 
ble. Indeed, a shower of new acronyms 
ending in “ML.” testifies to the inven- 
tiveness unleashed by XML in the sci- 


ences, in business and in the scholarly 
disciplines [see box on opposite page]. 
Before they can draft a new XML lan- 
guage, designers must agree on three 
things: which tags will be allowed, how 
tagged elements may nest within one 
another and how they should be pro- 
cessed. The first two—the language’s 
vocabulary and structure—are typically 
codified in a Document Type Definition, 
or DTD. The XML standard does not 
compel language designers to use DTDs, 
but most new languages will probably 
have them, because they make it much 
easier for programmers to write soft- 
ware that understands the markup and 
does intelligent things with it. 
Programmers will also need a set of 
guidelines that describe, in human lan- 
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guage, what all the tags mean. HTML, 
for instance, has a DTD but also hun- 
dreds of pages of descriptive prose that 
programmers refer to when they write 
browsers and other Web software. 


A Question of Style 


or users, it is what those programs 

do, not what the descriptions say, 
that is important. In many cases, people 
will want software to display XML-en- 
coded information to human readers. 
But XML tags offer no inherent clues 
about how the information should look 
on screen or on paper. 

This is actually an advantage for pub- 
lishers, who would often like to “write 
once and publish everywhere”—to dis- 
till the substance of a publication and 
then pour it into myriad forms, both 
printed and electronic. XML lets them 
do this by tagging content to describe 
its meaning, independent of the display 
medium. Publishers can then apply rules 
organized into “stylesheets” to reformat 
the work automatically for various de- 
vices. The standard now being devel- 
oped for XML stylesheets is called the 
Extensible Stylesheet Language, or XSL. 

The latest versions of several Web 
browsers can read an XML document, 
fetch the appropriate stylesheet, and use 
it to sort and format the information 
on the screen. The reader might never 
know that he is looking at XML rather 
than HTML—except that XML-based 
sites run faster and are easier to use. 

People with visual disabilities gain a 
free benefit from this approach to pub- 
lishing. Stylesheets will let them render 
XML into Braille or audible speech. The 
advantages extend to others as well: 
commuters who want to surf the Web 
in their cars may also find it handy to 
have pages read aloud. 

Although the Web has been a boon to 
science and to scholarship, it is com- 
merce (or rather the expectation of fu- 
ture commercial gain) that has fueled its 
lightning growth. The recent surge in re- 
tail sales over the Web has drawn much 
attention, but business-to-business com- 
merce is moving on-line at least as quick- 
ly. The flow of goods through the manu- 
facturing process, for example, begs for 
automation. But schemes that rely on 
complex, direct program-to-program in- 
teraction have not worked well in prac- 
tice, because they depend on a uniformi- 
ty of processing that does not exist. 

For centuries, humans have success- 
fully done business by exchanging stan- 


The Future of the Web 


New Languages for Science 


x ML offers a particularly convenient way for scien- 

tists to exchange theories, calculations and ex- 

perimental results. Mathematicians, among others, have 

long been frustrated by Web browsers’ ablity to display 

mathematical expressions only as pictures. MathML now 

allows them to insert equations into their Web pages with a few lines of simple 

text. Readers can then paste those expressions directly into algebra software for 
calculation or graphing. 

Chemists have gone a step further, developing new browser programs for their 
XML-based Chemical Markup Language (CML) that graphically render the molec- 
ular structure of compounds described in CML Web pages. Both CML and Astron- 
omy Markup Language will help researchers sift quickly through reams of journal 
citations to find just the papers that apply to the object of their study. As- 
tronomers, for example, can enter the sky coordinates of a galaxy to pull up a list of 
images, research papers and instrument data about that heavenly body. 

XML will be helpful for running experiments as well as analyzing their results. 
National Aeronautics and Space Administration engineers began work last year on 
Astronomical Instrument ML (AIML) as a way to enable scientists on the ground 
to control the SOFIA infrared telescope as it flies on a Boeing 747. AIML should 
eventually allow astronomers all over the world to control telescopes and perhaps 
even satellites through straightforward Internet browser software. 

Geneticists may soon be using Biosequence ML (BSML) to exchange and ma- 
nipulate the flood of information produced by gene-mapping and gene-sequenc- 
ing projects. A BSML browser built and distributed free by Visual Genomics in 
Columbus, Ohio, lets researchers search through vast databases of genetic code 
and display the resulting snippets as meaningful maps and charts rather than as 
obtuse strings of letters. —The Editors 


dardized documents: purchase orders, 
invoices, manifests, receipts and so on. 
Documents work for commerce be- 
cause they do not require the parties in- 
volved to know about one another’s in- 
ternal procedures. Each record exposes 
exactly what its recipient needs to 
know and no more. The exchange of 
documents is probably the right way to 
do business on-line, too. But this was 
not the job for which HTML was built. 

XML, in contrast, was designed for 
document exchange, and it is becoming 
clear that universal electronic commerce 
will rely heavily on a flow of agree- 
ments, expressed in millions of XML 
documents pulsing around the Internet. 

Thus, for its users, the XML-pow- 
ered Web will be faster, friendlier and a 


better place to do business. Web site de- 
signers, on the other hand, will find it 
more demanding. Battalions of program- 
mers will be needed to exploit new XML 
languages to their fullest. And although 
the day of the self-trained Web hacker 
is not yet over, the species is endangered. 
Tomorrow’s Web designers will need to 
be versed not just in the production of 
words and graphics but also in the con- 
struction of multilayered, interdepen- 
dent systems of DTDs, data trees, hy- 
perlink structures, metadata and style- 
sheets—a more robust infrastructure for 
the Web’s second generation. sal 
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Hypersearching the Web 


With the volume of on-line information in cyberspace growing at a 
breakneck pace, more effective search tools are desperately needed. 
A new technique analyzes how Web pages are linked together 


very day the World Wide Web 

grows by roughly a million 

electronic pages, adding to the 
hundreds of millions already on-line. 
This staggering volume of information 
is loosely held together by more than a 
billion annotated connections, called 
hyperlinks. For the first time in history, 
millions of people have virtually instant 
access from their homes and offices to 
the creative output of a significant— 
and growing—fraction of the planet’s 
population. 

But because of the Web’s rapid, chaot- 
ic growth, the resulting network of in- 
formation lacks organization and struc- 
ture. In fact, the Web has evolved into a 
global mess of previously unimagined 
proportions. Web pages can be written 
in any language, dialect or style by indi- 
viduals with any background, educa- 
tion, culture, interest and motivation. 
Each page might range from a few 
characters to a few hundred thousand, 
containing truth, falsehood, wisdom, 
propaganda or sheer nonsense. How, 
then, can one extract from this digital 
morass high-quality, relevant pages in 
response to a specific need for certain 
information? 

In the past, people have relied on 
search engines that hunt for specific 
words or terms. But such text searches 
frequently retrieve tens of thousands of 
pages, many of them useless. How can 
people quickly locate only the informa- 
tion they need and trust that it is au- 
thentic and reliable? 

We have developed a new kind of 
search engine that exploits one of the 
Web’s most valuable resources—its myr- 
iad hyperlinks. By analyzing these inter- 
connections, our system automatically 
locates two types of pages: authorities 
and hubs. The former are deemed to be 
the best sources of information on a 
particular topic; the latter are collec- 
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tions of links to those locations. Our 
methodology should enable users to lo- 
cate much of the information they de- 
sire quickly and efficiently. 


The Challenges of Search Engines 
C omputer disks have become in- 


creasingly inexpensive, enabling the 
storage of a large portion of the Web at 
a single site. At its most basic level, a 
search engine maintains a list, for every 
word, of all known Web pages contain- 
ing that word. Such a collection of lists 
is known as an index. So if people are 
interested in learning about acupunc- 
ture, they can access the “acupuncture” 
list to find all Web pages containing that 
word. 

Creating and maintaining this index is 
highly challenging [see “Searching the 
Internet,” by Clifford Lynch; ScrENn- 
TIFIC AMERICAN, March 1997], and de- 
termining what information to return in 
response to user requests remains 
daunting. Consider the unambiguous 
query for information on “Nepal Air- 
ways,” the airline company. Of the 
roughly 100 (at the time of this writing) 
Web pages containing the phrase, how 
does a search engine decide which 20 
or so are the best? One difficulty is that 
there is no exact and mathematically 
precise measure of “best”; indeed, it lies 
in the eye of the beholder. 

Search engines such as AltaVista, Info- 
seek, HotBot, Lycos and Excite use 
heuristics to determine the way in which 
to order—and thereby prioritize—pages. 
These rules of thumb are collectively 


known as a ranking function, which 
must apply not only to relatively 
specific and straightforward queries 
(“Nepal Airways”) but also to much 
more general requests, such as for “air- 
craft,” a word that appears in more 
than a million Web pages. How should 
a search engine choose just 20 from 
such a staggering number? 

Simple heuristics might rank pages by 
the number of times they contain the 
query term, or they may favor instances 
in which that text appears earlier. But 
such approaches can sometimes fail 
spectacularly. Tom Wolfe’s book The 
Kandy-Kolored Tangerine-Flake Stream- 
line Baby would, if ranked by such 
heuristics, be deemed very relevant to 
the query “hernia,” because it begins 
by repeating that word dozens of times. 
Numerous extensions to these rules of 
thumb abound, including approaches 
that give more weight to words that ap- 
pear in titles, in section headings or ina 
larger font. 

Such strategies are routinely thwarted 
by many commercial Web sites that de- 
sign their pages in certain ways specifi- 
cally to elicit favorable rankings. Thus, 
one encounters pages whose titles are 
“cheap airfares cheap airfares cheap air- 
fares.” Some sites write other carefully 
chosen phrases many times over in col- 
ors and fonts that are invisible to hu- 
man viewers. This practice, called spam- 
ming, has become one of the main rea- 
sons why it is currently so difficult to 
maintain an effective search engine. 

Spamming aside, even the basic as- 
sumptions of conventional text searches 


WEB PAGES (white dots) are scattered over the Internet with little structure, making it 
difficult for a person in the center of this electronic clutter to find only the information 
desired. Although this diagram shows just hundreds of pages, the World Wide Web 
currently contains more than 300 million of them. Nevertheless, an analysis of the way 
in which certain pages are linked to one another can reveal a hidden order. 


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC. 


APRIL 2002 


ALL ILLUSTRATIONS BY BRYAN CHRISTIE 


e 
en e 
e ye “e\2 
°. os - 
® VO \ 8 
y Ag e 
e~-efe 
s/s Va 
° ° . 
e °/ Ay 
ae) o/s 
ma © e_//® 
® is * 
@ e ss 
e |e @ 
éloscid well? 
© 
ee 
es* os 
» e 
é ere 
e e e\e 
e oe se 
e a en 
@ . a “ 
e ows S 
INX 
. € aks 
So 
Se Me I 
we © 


19 SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE APRIL 2002 
COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC. 


are suspect. To wit, pages that are high- 
ly relevant will not always contain the 
query term, and others that do may be 
worthless. A major cause of this prob- 
lem is that human language, in all its 
richness, is awash in synonymy (differ- 
ent words having the same meaning) 
and polysemy (the same word having 
multiple meanings). Because of the for- 
mer, a query for “automobile” will miss 
a deluge of pages that lack that word 
but instead contain “car.” The latter 
manifests itself in a simple query for 
“Jaguar,” which will retrieve thousands 
of pages about the automobile, the jun- 
gle cat and the National Football 
League team, among other topics. 

One corrective strategy is to augment 
search techniques with stored informa- 
tion about semantic relations between 
words. Such compilations, typically con- 
structed by a team of linguists, are some- 
times known as semantic networks, fol- 
lowing the seminal work on the Word- 
Net project by George A. Miller and his 
colleagues at Princeton University. An 
index-based engine with access to a se- 
mantic network could, on receiving the 
query for “automobile,” first determine 
that “car” is equivalent and then re- 
trieve all Web pages containing either 
word. But this process is a double- 
edged sword: it helps with synonymy 
but can aggravate polysemy. 

Even as a cure for synonymy, the so- 
lution is problematic. Constructing and 
maintaining a semantic network that is 
exhaustive and cross-cultural (after all, 
the Web knows no geographical bound- 
aries) are formidable tasks. The process 
is especially difficult on the Internet, 
where a whole new language is evolv- 
ing—words such as “FAQs,” “zines” 
and “bots” have emerged, whereas oth- 
er words such as “surf” and “browse” 
have taken on additional meanings. 

Our work on the Clever project at 
IBM originated amid this perplexing ar- 
ray of issues. Early on, we realized that 
the current scheme of indexing and re- 
trieving a page based solely on the text 
it contained ignores more than a billion 
carefully placed hyperlinks that reveal 
the relations between pages. But how 
exactly should this information be used? 


FINDING authorities and hubs can be tricky because of the circular way in which they 
are defined: an authority is a page that is pointed to by many hubs; a hub is a site that 
links to many authorities. The process, however, can be performed mathematically. 
Clever, a prototype search engine, assigns initial scores to candidate Web pages on a 
particular topic. Clever then revises those numbers in repeated series of calculations, 
with each iteration dependent on the values of the previous round. The computations 
continue until the scores eventually settle on their final values, which can then be used 


to determine the best authorities and hubs. 


When people perform a search for 
“Harvard,” many of them want to 
learn more about the Ivy League 
school. But more than a million loca- 
tions contain “Harvard,” and the uni- 
versity’s home page is not the one that 
uses it the most frequently, the earliest 
or in any other way deemed especially 
significant by traditional ranking func- 
tions. No entirely internal feature of 
that home page truly seems to reveal its 
importance. 

Indeed, people design Web pages 
with all kinds of objectives in mind. For 
instance, large corporations want their 
sites to convey a certain feel and project 
a specific image—goals that might be 
very different from that of describing 
what the company does. Thus, IBM’s 
home page does not contain the word 
“computer.” For these types of situa- 
tions, conventional search techniques 
are doomed from the start. 

To address such concerns, human ar- 
chitects of search engines have been 
tempted to intervene. After all, they be- 
lieve they know what the appropriate 
responses to certain queries should be, 
and developing a ranking function that 


AUTHORITIES AND HUBS help to organize information on the Web, however infor- 
mally and inadvertently. Authorities (@) are sites that other Web pages happen to link to 
frequently on a particular topic. For the subject of human rights, for instance, the home 
page of Amnesty International might be one such location. Hubs (@) are sites that tend 
to cite many of those authorities, perhaps in a resource list or in a “My Favorite Links” 


section on a personal home page. 
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will automatically produce those results 
has been a troublesome undertaking. So 
they could maintain a list of queries like 
“Harvard” for which they will override 
the judgment of the search engine with 
predetermined “right” answers. 

This approach is being taken by a 
number of search engines. In fact, a ser- 
vice such as Yahoo! contains only hu- 
man-selected pages. But there are 
countless possible queries. How, with a 
limited number of human experts, can 
one maintain all these lists of precom- 
puted responses, keeping them reason- 
ably complete and up-to-date, as the 
Web meanwhile grows by a million 
pages a day? 


Searching with Hyperlinks 


Li our work, we have been attacking 
the problem in a different way. We 
have developed an automatic technique 
for finding the most central, authorita- 
tive sites on broad search topics by 
making use of hyperlinks, one of the 
Web’s most precious resources. It is the 
hyperlinks, after all, that pull together 
the hundreds of millions of pages into a 
web of knowledge. It is through these 
connections that users browse, serendip- 
itously discovering valuable information 
through the pointers and recommenda- 
tions of people they have never met. 
The underlying assumption of our 
approach views each link as an implicit 
endorsement of the location to which it 
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points. Consider the Web site of a hu- 
man-rights activist that directs people 
to the home page of Amnesty Interna- 
tional. In this case, the reference clearly 
signifies approval. 

Of course, a link may also exist purely 
for navigational purposes (“Click here 
to return to the main menu”), as a paid 
advertisement (“The vacation of your 
dreams is only a click away”) or as a 
stamp of disapproval (“Surf to this site 
to see what this fool says”). We believe, 
however, that in aggregate—that is, when 
a large enough number is considered— 
Web links do confer authority. 

In addition to expert sites that have 
garnered many recommendations, the 
Web is full of another type of page: 
hubs that link to those prestigious loca- 
tions, tacitly radiating influence out- 
ward to them. Hubs appear in guises 
ranging from professionally assembled 
lists on commercial sites to inventories 
of “My Favorite Links” on personal 
home pages. So even if we find it difficult 
to define “authorities” and “hubs” in 
isolation, we can state this much: a re- 
spected authority is a page that is re- 
ferred to by many good hubs; a useful 
hub is a location that points to many 
valuable authorities. 

These definitions look hopelessly cir- 
cular. How could they possibly lead to 
a computational method of identifying 
both authorities and hubs? Thinking of 
the problem intuitively, we devised the 
following algorithm. To start off, we 
look at a set of candidate pages about a 
particular topic, and for each one we 
make our best guess about how good a 
hub it is and how good an authority it 
is. We then use these initial estimates to 
jump-start a two-step iterative process. 

First, we use the current guesses about 
the authorities to improve the estimates 
of hubs—we locate all the best authori- 
ties, see which pages point to them and 
call those locations good hubs. Second, 
we take the updated hub information 
to refine our guesses about the authori- 
ties—we determine where the best hubs 
point most heavily and call these the 
good authorities. Repeating these steps 
several times fine-tunes the results. 

We have implemented this algorithm 
in Clever, a prototype search engine. 
For any query of a topic—say, acupunc- 
ture—Clever first obtains a list of 200 
pages from a standard text index such 
as AltaVista. The system then augments 
these by adding all pages that link to 
and from that 200. In our experience, 
the resulting collection, called the root 
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set, will typically contain between 1,000 
and 5,000 pages. 

For each of these, Clever assigns ini- 
tial numerical hub and authority scores. 
The system then refines the values: the 
authority score of each page is updated 
to be the sum of the hub scores of other 
locations that point to it; a hub score is 
revised to be the sum of the authority 
scores of locations to which a page 
points. In other words, a page that has 
many high-scoring hubs pointing to it 
earns a higher authority score; a loca- 
tion that points to many high-scoring 
authorities garners a higher hub score. 
Clever repeats these calculations until 
the scores have more or less settled on 
their final values, from which the best 
authorities and hubs can be deter- 
mined. (Note that the computations do 
not preclude a particular page from 
achieving a top rank in both categories, 
as sometimes occurs.) 

The algorithm might best be under- 
stood in visual terms. Picture the Web 
as a vast network of innumerable sites, 
all interconnected in a seemingly ran- 
dom fashion. For a given set of pages 
containing a certain word or term, 
Clever zeroes in on the densest pattern 
of links between those pages. 

As it turns out, the iterative summa- 
tion of hub and authority scores can be 
analyzed with stringent mathematics. 
Using linear algebra, we can represent 
the process as the repeated multiplica- 
tion of a vector (specifically, a row of 
numbers representing the hub or au- 
thority scores) by a matrix (a two-di- 
mensional array of numbers represent- 
ing the hyperlink structure of the root 
set). The final results of the process are 
hub and authority vectors that have 
equilibrated to certain numbers—values 
that reveal which pages are the best 
hubs and authorities, respectively. (In 
the world of linear algebra, such a stabi- 
lized row of numbers is called an eigen- 
vector; it can be thought of as the solu- 
tion to a system of equations defined by 
the matrix.) 

With further linear algebraic analysis, 
we have shown that the iterative pro- 
cess will rapidly settle to a relatively 
steady set of hub and authority scores. 
For our purposes, a root set of 3,000 
pages requires about five rounds of cal- 


culations. Furthermore, the results are 
generally independent of the initial esti- 
mates of scores used to start the pro- 
cess. The method will work even if the 
values are all initially set to be equal to 1. 
So the final hub and authority scores are 
intrinsic to the collection of pages in the 
root set. 

A useful by-product of Clever’s itera- 
tive processing is that the algorithm nat- 
urally separates Web sites into clusters. 
A search for information on abortion, 
for example, results in two types of lo- 
cations, pro-life and pro-choice, because 
pages from one group are more likely to 
link to one another than to those from 
the other community. 

From a larger perspective, Clever’s al- 
gorithm reveals the underlying structure 
of the World Wide Web. Although the 
Internet has grown in a hectic, willy- 
nilly fashion, it does indeed have an in- 
herent—albeit inchoate—order based on 
how pages are linked. 


The Link to Citation Analysis 


N | ethodologically, the Clever algo- 


rithm has close ties to citation 
analysis, the study of patterns of how 
scientific papers make reference to one 
another. Perhaps the field’s best-known 
measure of a journal’s importance is the 
“impact factor.” Developed by Eugene 
Garfield, a noted information scientist 
and founder of Science Citation Index, 
the metric essentially judges a publication 
by the number of citations it receives. 
On the Web, the impact factor would 
correspond to the ranking of a page sim- 
ply by a tally of the number of links that 
point to it. But this approach is typically 
not appropriate, because it can favor 
universally popular locations, such as 
the home page of the New York Times, 
regardless of the specific query topic. 
Even in the area of citation analysis, 
researchers have attempted to improve 
Garfield’s measure, which counts each 
reference equally. Would not a better 
strategy give additional weight to cita- 
tions from a journal deemed more im- 
portant? Of course, the difficulty with 
this approach is that it leads to a circu- 
lar definition of “importance,” similar 
to the problem we encountered in speci- 
fying hubs and authorities. As early as 


CYBERCOMMUNITEES (shown in different colors) populate the Web. An exploration 
of this phenomenon has uncovered various groups on topics as arcane as oil spills off the 
coast of Japan, fire brigades in Australia and resources for Turks living in the U.S. The 
Web is filled with hundreds of thousands of such finely focused communities. 
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1976 Gabriel Pinski and Francis Narin 
of CHI Research in Haddon Heights, 
N.J., overcame this hurdle by develop- 
ing an iterated method for computing a 
stable set of adjusted scores, which they 
termed influence weights. In contrast to 
our work, Pinski and Narin did not in- 
voke a distinction between authorities 
and hubs. Their method essentially pass- 
es weight directly from one good author- 
ity to another. 

This difference raises a fundamental 
point about the Web versus traditional 
printed scientific literature. In cyber- 
space, competing authorities (for exam- 
ple, Netscape and Microsoft on the 
topic of browsers) frequently do not ac- 
knowledge one another’s existence, so 
they can be connected only by an inter- 
mediate layer of hubs. Rival prominent 
scientific journals, on the other hand, 
typically do a fair amount of cross-cita- 
tion, making the role of hubs much less 
crucial. 

A number of groups are also investi- 
gating the power of hyperlinks for 
searching the Web. Sergey Brin and 
Lawrence Page of Stanford University, 
for instance, have developed a search 
engine dubbed Google that implements 
a link-based ranking measure related to 
the influence weights of Pinski and Nar- 
in. The Stanford scientists base their ap- 
proach on a model of a Web surfer who 
follows links and makes occasional hap- 
hazard jumps, arriving at certain places 
more frequently than others. Thus, 
Google finds a single type of universally 
important page—intuitively, locations 
that are heavily visited in a random 
traversal of the Web’s link structure. In 
practice, for each Web page Google ba- 
sically sums the scores of other loca- 


tions pointing to it. So, when presented 
with a specific query, Google can re- 
spond by quickly retrieving all pages con- 
taining the search text and listing them 
according to their preordained ranks. 

Google and Clever have two main dif- 
ferences. First, the former assigns initial 
rankings and retains them independently 
of any queries, whereas the latter assem- 
bles a different root set for each search 
term and then prioritizes those pages in 
the context of that particular query. Con- 
sequently, Google’s approach enables 
faster response. Second, Google’s basic 
philosophy is to look only in the forward 
direction, from link to link. In contrast, 
Clever also looks backward from an au- 
thoritative page to see what locations 
are pointing there. In this sense, Clever 
takes advantage of the sociological phe- 
nomenon that humans are innately moti- 
vated to create hublike content express- 
ing their expertise on specific topics. 


The Search Continues 


e are exploring a number of 

ways to enhance Clever. A fun- 
damental direction in our overall ap- 
proach is the integration of text and hy- 
perlinks. One strategy is to view certain 
links as carrying more weight than oth- 
ers, based on the relevance of the text in 
the referring Web location. Specifically, 
we can analyze the contents of the 
pages in the root set for the occurrences 
and relative positions of the query topic 
and use this information to assign nu- 
merical weights to some of the connec- 
tions between those pages. If the query 
text appeared frequently and close to a 
link, for instance, the corresponding 
weight would be increased. 


Our preliminary experiments suggest 
that this refinement substantially in- 
creases the focus of the search results. 
(A shortcoming of Clever has been that 
for a narrow topic, such as Frank Lloyd 
Wright’s house Fallingwater, the system 
sometimes broadens its search and re- 
trieves information on a general subject, 
such as American architecture.) We are 
investigating other improvements, and 
given the many styles of authorship on 
the Web, the weighting of links might 
incorporate page content in a variety of 
ways. 

We have also begun to construct lists 
of Web resources, similar to the guides 
put together manually by employees of 
companies such as Yahoo! and Info- 
seek. Our early results indicate that au- 
tomatically compiled lists can be com- 
petitive with handcrafted ones. Further- 
more, through this work we have found 
that the Web teems with tightly knit 
groups of people, many with offbeat com- 
mon interests (such as weekend sumo en- 
thusiasts who don bulky plastic outfits 
and wrestle each other for fun), and we 
are currently investigating efficient and 
automatic methods for uncovering these 
hidden communities. 

The World Wide Web of today is dra- 
matically different from that of just five 
years ago. Predicting what it will be like 
in another five years seems futile. Will 
even the basic act of indexing the Web 
soon become infeasible? And if so, will 
our notion of searching the Web undergo 
fundamental changes? For now, the one 
thing we feel certain in saying is that the 
Web’s relentless growth will continue to 
generate computational challenges for 
wading through the ever increasing vol- 
ume of on-line information. 
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THE ENTERTAINMENT SYSTEM was 
belting out the Beatles’ “We Can Work It 
Out” when the phone rang. When Pete 
answered, his phone turned the sound 
down by sending a message to all the oth- 
er local devices that had a volume control. 
His sister, Lucy, was on the line from the 
doctor’s office: “Mom needs to see a spe- 
cialist and then has to have a series of 
physical therapy sessions. Biweekly or 
something. I’m going to have my agent set 
up the appointments.” Pete immediately 
agreed to share the chauffeuring. 

At the doctor’s office, Lucy instruct- 
ed her Semantic Web agent through her 
handheld Web browser. The agent 
promptly retrieved information about 
Mom’s prescribed treatment from the 
doctor’s agent, looked up several lists of 
providers, and checked for the ones 
in-plan for Mom’s insurance within a 20- 
mile radius of her home and with a rat- 
ing of excellent or very good on trusted 
rating services. It then began trying to find 
a match between available appointment 
times (supplied by the agents of individ- 
ual providers through their Web sites) and 
Pete’s and Lucy’s busy schedules. (The em- 
phasized keywords indicate terms whose 
semantics, or meaning, were defined for 
the agent through the Semantic Web.) 

In a few minutes the agent presented 
them with a plan. Pete didn’t like it—Uni- 
versity Hospital was all the way across 
town from Mom’s place, and he’d be dri- 
ving back in the middle of rush hour. He 
set his own agent to redo the search with 
stricter preferences about Jocation and 
time. Lucy’s agent, having complete 


trust in Pete’s agent in the context of the 
present task, automatically assisted by 
supplying access certificates and shortcuts 
to the data it had already sorted through. 

Almost instantly the new plan was 
presented: a much closer clinic and earli- 
er times—but there were two warning 
notes. First, Pete would have to reschedule 
a couple of his Jess important appoint- 
ments. He checked what they were—not a 
problem. The other was something about 
the insurance company’s list failing to in- 
clude this provider under physical ther- 
apists: “Service type and insurance plan 
status securely verified by other means,” 
the agent reassured him. “(Details?)” 

Lucy registered her assent at about the 
same moment Pete was muttering, “Spare 
me the details,” and it was all set. (Of 
course, Pete couldn’t resist the details and 
later that night had his agent explain how 
it had found that provider even though it 
wasn’t on the proper list.) 


Expressing Meaning 

PETE AND Lucy could use their agents to 
carry out all these tasks thanks not to the 
World Wide Web of today but rather the 
Semantic Web that it will evolve into to- 
morrow. Most of the Web’s content to- 
day is designed for humans to read, not 
for computer programs to manipulate 
meaningfully. Computers can adeptly 
parse Web pages for layout and routine 
processing—here a header, there a link to 
another page—but in general, computers 
have no reliable way to process the se- 
mantics: this is the home page of the Hart- 
man and Strauss Physio Clinic, this link 


goes to Dr. Hartman’s curriculum vitae. 
The Semantic Web will bring structure to 
the meaningful content of Web pages, 
creating an environment where software 
agents roaming from page to page can 
readily carry out sophisticated tasks for 
users. Such an agent coming to the clinic’s 
Web page will know not just that the page 
has keywords such as “treatment, medi- 
cine, physical, therapy” (as might be en- 
coded today) but also that Dr. Hartman 
works at this clinic on Mondays, 
Wednesdays and Fridays and that the 
script takes a date range in yyyy-mm- 
dd format and returns appointment 
times. And it will “know” all this with- 
out needing artificial intelligence on the 
scale of 2001’s Hal or Star Wars’s C- 
3PO. Instead these semantics were en- 
coded into the Web page when the clinic’s 
office manager (who never took Comp 
Sci 101) massaged it into shape using off- 
the-shelf software for writing Semantic 
Web pages along with resources listed on 
the Physical Therapy Association’s site. 
The Semantic Web is not a separate 
Web but an extension of the current one, 
in which information is given well-defined 
meaning, better enabling computers and 
people to work in cooperation. The first 
steps in weaving the Semantic Web into 
the structure of the existing Web are al- 
ready under way. In the near future, these 
developments will usher in significant 
new functionality as machines become 
much better able to process and “under- 
stand” the data that they merely display 
at present. The essential property of the 
World Wide Web is its universality. The 


The Future of the Web 
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power of a hypertext link is that “any- 
thing can link to anything.” Web tech- 
nology, therefore, must not discriminate 
between the scribbled draft and the pol- 
ished performance, between commercial 
and academic information, or among cul- 
tures, languages, media and so on. Infor- 
mation varies along many axes. One of 
these is the difference between informa- 
tion produced primarily for human con- 
sumption and that produced mainly for 
machines. At one end of the scale we have 
everything from the five-second TV com- 
mercial to poetry. At the other end we 
have databases, programs and sensor out- 
put. To date, the Web has developed most 
rapidly as a medium of documents for 
people rather than for data and informa- 
tion that can be processed automatically. 
The Semantic Web aims to make up for 
this. 

Like the Internet, the Semantic Web 
will be as decentralized as possible. Such 
Web-like systems generate a lot of excite- 
ment at every level, from major corpora- 
tion to individual user, and provide bene- 
fits that are hard or impossible to predict 
in advance. Decentralization requires 
compromises: the Web had to throw away 
the ideal of total consistency of all of its in- 
terconnections, ushering in the infamous 
message “Error 404: Not Found” but al- 
lowing unchecked exponential growth. 


Knowledge Representation 
FOR THE SEMANTIC WEB to function, 
computers must have access to structured 
collections of information and sets of in- 
ference rules that they can use to conduct 
automated reasoning. Artificial-intelli- 
gence researchers have studied such sys- 
tems since long before the Web was de- 
veloped. Knowledge representation, as 
this technology is often called, is current- 
ly in a state comparable to that of hyper- 
text before the advent of the Web: it is 
clearly a good idea, and some very nice 
demonstrations exist, but it has not yet 
changed the world. It contains the seeds 
of important applications, but to realize 
its full potential it must be linked into a 
single global system. 

Traditional knowledge-representa- 
tion systems typically have been central- 
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ized, requiring everyone to share exactly 
the same definition of common concepts 
such as “parent” or “vehicle.” But central 
control is stifling, and increasing the size 
and scope of such a system rapidly be- 
comes unmanageable. 

Moreover, these systems usually care- 
fully limit the questions that can be asked 
so that the computer can answer reliably— 
or answer at all. The problem is reminis- 
cent of Godel’s theorem from mathemat- 
ics: any system that is complex enough to 
be useful also encompasses unanswerable 
questions, much like sophisticated ver- 
sions of the basic paradox “This sentence 
is false.” To avoid such problems, tradi- 
tional knowledge-representation systems 
generally each had their own narrow and 
idiosyncratic set of rules for making infer- 
ences about their data. For example, a ge- 
nealogy system, acting on a database of 
family trees, might include the rule “a wife 
of an uncle is an aunt.” Even if the data 
could be transferred from one system to 
another, the rules, existing in a complete- 
ly different form, usually could not. 

Semantic Web researchers, in contrast, 
accept that paradoxes and unanswerable 
questions are a price that must be paid to 
achieve versatility. We make the language 
for the rules as expressive as needed to al- 
low the Web to reason as widely as de- 
sired. This philosophy is similar to that of 
the conventional Web: early in the Web’s 
development, detractors pointed out that 
it could never be a well-organized library; 
without a central database and tree struc- 
ture, one would never be sure of finding 
everything. They were right. But the ex- 
pressive power of the system made vast 
amounts of information available, and 
search engines (which would have seemed 
quite impractical a decade ago) now pro- 
duce remarkably complete indices of a lot 
of the material out there. 

The challenge of the Semantic Web, 
therefore, is to provide a language that 
expresses both data and rules for reason- 
ing about the data and that allows rules 
from any existing knowledge-representa- 
tion system to be exported onto the Web. 

Adding logic to the Web—the means 
to use rules to make inferences, choose 
courses of action and answer questions— 
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is the task before the Semantic Web com- 
munity at the moment. A mixture of 
mathematical and engineering decisions 
complicate this task. The logic must be 
powerful enough to describe complex 
properties of objects but not so power- 
ful that agents can be tricked by being 
asked to consider a paradox. Fortunate- 
ly, a large majority of the information we 
want to express is along the lines of “a 
hex-head bolt is a type of machine bolt,” 
which is readily written in existing lan- 
guages with a little extra vocabulary. 

Two important technologies for de- 
veloping the Semantic Web are already in 
place: eXtensible Markup Language 
(XML) and the Resource Description 
Framework (RDF). XML lets everyone 
create their own tags—hidden labels such 
as <zip code> or <alma mater> that an- 
notate Web pages or sections of text ona 
page. Scripts, or programs, can make use 
of these tags in sophisticated ways, but 
the script writer has to know what the 
page writer uses each tag for. In short, 
XML allows users to add arbitrary struc- 
ture to their documents but says nothing 
about what the structures mean [see 
“XML and the Second-Generation Web,” 
by Jon Bosak and Tim Bray; SCIENTIFIC 
AMERICAN, May 1999]. 

Meaning is expressed by RDF, which 
encodes it in sets of triples, each triple be- 
ing rather like the subject, verb and object 
of an elementary sentence. These triples 
can be written using XML tags. In RDF, 
a document makes assertions that partic- 
ular things (people, Web pages or what- 
ever) have properties (such as “is a sister 
of,” “is the author of”) with certain val- 
ues (another person, another Web page). 
This structure turns out to be a natural 
way to describe the vast majority of the 
data processed by machines. Subject and 
object are each identified by a Universal 
Resource Identifier (URI), just as used in 
a link on a Web page. (URLs, Uniform 
Resource Locators, are the most common 
type of URI.) The verbs are also identified 
by URIs, which enables anyone to define 
a new concept, a new verb, just by defin- 
ing a URI for it somewhere on the Web. 

Human language thrives when using 
the same term to mean somewhat differ- 
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HTML: Hypertext Markup Language. The language used to encode formatting, 

links and other features on Web pages. Uses standardized “tags” such as <H1> and 
<BODY> whose meaning and interpretation is set universally by the World Wide 

Web Consortium. 

XML: eXtensible Markup Language. A markup language like HTML that lets 
individuals define and use their own tags. XML has no built-in mechanism to convey 
the meaning of the user’s new tags to other users. 

RESOURCE: Web jargon for any entity. Includes Web pages, parts of 

a Web page, devices, people and more. 

URL: Uniform Resource Locator. The familiar codes (such as 
http://www.sciam.com/index.html) that are used in hyperlinks. 

URI: Universal Resource Identifier. URLs are the most familiar type of URI. AURI 
defines or specifies an entity, not necessarily by naming its location on the Web. 
RDF: Resource Description Framework. A scheme for defining information on the Web. 
RDF provides the technology for expressing the meaning of terms and concepts ina 
form that computers can readily process. RDF can use XML for its syntax and URIs to 


specify entities, concepts, properties and relations. 

ONTOLOGIES: Collections of statements written in a language such as RDF that 
define the relations between concepts and specify logical rules for reasoning 

about them. Computers will “understand” the meaning of semantic data on a Web 
page by following links to specified ontologies. 

AGENT: A piece of software that runs without direct human control or constant 
supervision to accomplish goals provided by a user. Agents typically collect, filter and 
process information found on the Web, sometimes with the help of other agents. 
SERVICE DISCOVERY: The process of locating an agent or automated Web-based 
service that will perform a required function. Semantics will enable agents to describe 
to one another precisely what function they carry out and what input data are needed. 


ent things, but automation does not. 
Imagine that I hire a clown messenger ser- 
vice to deliver balloons to my customers 
on their birthdays. Unfortunately, the 
service transfers the addresses from my 
database to its database, not knowing 
that the “addresses” in mine are where 
bills are sent and that many of them are 
post office boxes. My hired clowns end 
up entertaining a number of postal work- 
ers—not necessarily a bad thing but cer- 
tainly not the intended effect. Using a dif- 
ferent URI for each specific concept solves 
that problem. An address that is a mailing 
address can be distinguished from one that 
is a street address, and both can be distin- 
guished from an address that is a speech. 

The triples of RDF form webs of in- 
formation about related things. Because 
RDF uses URIs to encode this informa- 
tion in a document, the URIs ensure that 
concepts are not just words in a docu- 
ment but are tied to a unique definition 
that everyone can find on the Web. For 
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example, imagine that we have access to 
a variety of databases with information 
about people, including their addresses. 
If we want to find people living in a spe- 
cific zip code, we need to know which 
fields in each database represent names 
and which represent zip codes. RDF can 
specify that “(field 5 in database A) (is a 
field of type) (zip code),” using URIs 
rather than phrases for each term. 


Ontologies 
OF COURSE, THIS IS NOT the end of the 
story, because two databases may use 
different identifiers for what is in fact the 
same concept, such as Zip code. A pro- 
gram that wants to compare or combine 
information across the two databases has 
to know that these two terms are being 
used to mean the same thing. Ideally, the 
program must have a way to discover 
such common meanings for whatever 
databases it encounters. 

A solution to this problem is provid- 
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ed by the third basic component of the 
Semantic Web, collections of informa- 
tion called ontologies. In philosophy, an 
ontology is a theory about the nature of 
existence, of what types of things exist; 
ontology as a discipline studies such the- 
ories. Artificial-intelligence and Web re- 
searchers have co-opted the term for their 
own jargon, and for them an ontology is 
a document or file that formally defines 
the relations among terms. The most typ- 
ical kind of ontology for the Web has a 
taxonomy and a set of inference rules. 

The taxonomy defines classes of ob- 
jects and relations among them. For ex- 
ample, an address may be defined as a 
type of location, and city codes may be 
defined to apply only to locations, and 
so on. Classes, subclasses and relations 
among entities are a very powerful tool 
for Web use. We can express a large 
number of relations among entities by as- 
signing properties to classes and allowing 
subclasses to inherit such properties. If 
city codes must be of type city and 
cities generally have Web sites, we can 
discuss the Web site associated with a 
city code even if no database links a city 
code directly to a Web site. 

Inference rules in ontologies supply 
further power. An ontology may express 
the rule “If a city code is associated with 
a state code, and an address uses that city 
code, then that address has the associated 
state code.” A program could then read- 
ily deduce, for instance, that a Cornell 
University address, being in Ithaca, must 
be in New York State, which is in the 
U.S., and therefore should be formatted 
to U.S. standards. The computer doesn’t 
truly “understand” any of this informa- 
tion, but it can now manipulate the terms 
much more effectively in ways that are 
useful and meaningful to the human user. 

With ontology pages on the Web, so- 
lutions to terminology (and other) prob- 
lems begin to emerge. The meaning of 
terms or XML codes used on a Web page 
can be defined by pointers from the page 
to an ontology. Of course, the same prob- 
lems as before now arise if I point to an 
ontology that defines addresses as con- 
taining a Zip code and you point to one 
that uses postal code. This kind of con- 
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fusion can be resolved if ontologies (or 
other Web services) provide equivalence 
relations: one or both of our ontologies 
may contain the information that my Zip 
code is equivalent to your postal code. 

Our scheme for sending in the clowns 
to entertain my customers is partially 
solved when the two databases point to 
different definitions of address. The 
program, using distinct URIs for differ- 
ent concepts of address, will not con- 
fuse them and in fact will need to discov- 
er that the concepts are related at all. The 
program could then use a service that 
takes a list of postal addresses (defined 
in the first ontology) and converts it into 
a list of physical addresses (the second 
ontology) by recognizing and removing 
post office boxes and other unsuitable 
addresses. The structure and semantics 
provided by ontologies make it easier 
for an entrepreneur to provide such a 
service and can make its use completely 
transparent. 

Ontologies can enhance the func- 
tioning of the Web in many ways. They 
can be used in a simple fashion to im- 
prove the accuracy of Web searches—the 
search program can look for only those 
pages that refer to a precise concept in- 
stead of all the ones using ambiguous 
keywords. More advanced applications 
will use ontologies to relate the informa- 
tion on a page to the associated knowl- 
edge structures and inference rules. An 
example of a page marked up for such 
use is online at http://www.cs.umd.edu/~ 
hendler. If you send your Web browser 
to that page, you will see the normal Web 
page entitled “Dr. James A. Hendler.” As 
a human, you can readily find the link to 
a short biographical note and read there 


that Hendler received his Ph.D. from 
Brown University. A computer program 
trying to find such information, howev- 
er, would have to be very complex to 
guess that this information might be in a 
biography and to understand the English 
language used there. 

For computers, the page is linked to 
an ontology page that defines informa- 
tion about computer science depart- 
ments. For instance, professors work at 
universities and they generally have doc- 
torates. Further markup on the page (not 
displayed by the typical Web browser) 
uses the ontology’s concepts to specify 
that Hendler received his Ph.D. from the 
entity described at the URI http://www. 
brown.edu/—the Web page for Brown. 
Computers can also find that Hendler is 
a member of a particular research pro- 
ject, has a particular e-mail address, and 
so on. All that information is readily 
processed by a computer and could be 
used to answer queries (such as where 
Dr. Hendler received his degree) that cur- 
rently would require a human to sift 
through the content of various pages 
turned up by a search engine. 

In addition, this markup makes it 
much easier to develop programs that 
can tackle complicated questions whose 
answers do not reside on a single Web 
page. Suppose you wish to find the Ms. 
Cook you met at a trade conference last 
year. You don’t remember her first name, 
but you remember that she worked for 
one of your clients and that her son was 
a student at your alma mater. An intelli- 
gent search program can sift through 
all the pages of people whose name is 
“Cook” (sidestepping all the pages relat- 
ing to cooks, cooking, the Cook Islands 
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and so forth), find the ones that mention 
working for a company that’s on your 
list of clients and follow links to Web 
pages of their children to track down if 
any are in school at the right place. 


Agents 

THE REAL POWER of the Semantic Web 
will be realized when people create many 
programs that collect Web content from 
diverse sources, process the information 
and exchange the results with other pro- 
grams. The effectiveness of such software 
agents will increase exponentially as more 
machine-readable Web content and auto- 
mated services (including other agents) be- 
come available. The Semantic Web pro- 
motes this synergy: even agents that were 
not expressly designed to work together 
can transfer data among themselves when 
the data come with semantics. 

An important facet of agents’ func- 
tioning will be the exchange of “proofs” 
written in the Semantic Web’s unifying 
language (the language that expresses log- 
ical inferences made using rules and infor- 
mation such as those specified by ontolo- 
gies). For example, suppose Ms. Cook’s 
contact information has been located by 
an online service, and to your great sur- 
prise it places her in Johannesburg. Nat- 
urally, you want to check this, so your 
computer asks the service for a proof of 
its answer, which it promptly provides by 
translating its internal reasoning into the 
Semantic Web’s unifying language. An in- 
ference engine in your computer readily 
verifies that this Ms. Cook indeed match- 
es the one you were seeking, and it can 
show you the relevant Web pages if you 
still have doubts. Although they are still 
far from plumbing the depths of the Se- 
mantic Web’s potential, some programs 
can already exchange proofs in this way, 
using the current preliminary versions of 
the unifying language. 

Another vital feature will be digital 
signatures, which are encrypted blocks of 
data that computers and agents can use 
to verify that the attached information 
has been provided by a specific trusted 
source. You want to be quite sure that a 
statement sent to your accounting pro- 
gram that you owe money to an online 
retailer is not a forgery generated by the 
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computer-savvy teenager next door. 
Agents should be skeptical of assertions 
that they read on the Semantic Web un- 
til they have checked the sources of in- 
formation. (We wish more people would 
learn to do this on the Web as it is!) 

Many automated Web-based services 
already exist without semantics, but oth- 
er programs such as agents have no way 
to locate one that will perform a specific 
function. This process, called service dis- 
covery, can happen only when there is a 
common language to describe a service in 
a way that lets other agents “under- 
stand” both the function offered and how 
to take advantage of it. Services and agents 
can advertise their function by, for ex- 
ample, depositing such descriptions in di- 
rectories analogous to the Yellow Pages. 

Some low-level service-discovery 
schemes are currently available, such as 
Microsoft’s Universal Plug and Play, 
which focuses on connecting different 
types of devices, and Sun Microsystems’s 
Jini, which aims to connect services. 
These initiatives, however, attack the 
problem at a structural or syntactic level 
and rely heavily on standardization of a 
predetermined set of functionality de- 
scriptions. Standardization can only go 
so far, because we can’t anticipate all 
possible future needs. 

The Semantic Web, in contrast, is 
more flexible. The consumer and pro- 
ducer agents can reach a shared under- 
standing by exchanging ontologies, 
which provide the vocabulary needed for 
discussion. Agents can even “bootstrap” 
new reasoning capabilities when they dis- 
cover new ontologies. Semantics also 
makes it easier to take advantage of a ser- 
vice that only partially matches a request. 

A typical process will involve the cre- 
ation of a “value chain” in which sub- 
assemblies of information are passed from 
one agent to another, each one “adding 
value,” to construct the final product re- 
quested by the end user. Make no mistake: 
to create complicated value chains auto- 
matically on demand, some agents will ex- 
ploit artificial-intelligence technologies in 
addition to the Semantic Web. But the Se- 
mantic Web will provide the foundations 
and the framework to make such tech- 
nologies more feasible. 
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SOFTWARE AGENTS will be greatly facilitated by semantic content on the Web. In the depicted scenario, 
Lucy’s agent tracks down a physical therapy clinic for her mother that meets a combination of criteria and 


has open appointment times that mesh with her and her brother Pete’s schedules. Ontologies that define 


the meaning of semantic data play a key role in enabling the agent to understand what is on the Semantic 


Web, interact with sites and employ other automated services. 


Putting all these features together re- 
sults in the abilities exhibited by Pete’s 
and Lucy’s agents in the scenario that 
opened this article. Their agents would 
have delegated the task in piecemeal fash- 
ion to other services and agents discov- 
ered through service advertisements. For 
example, they could have used a trusted 
service to take a list of providers and de- 
termine which of them are in-plan for a 
specified insurance plan and course of 
treatment. The list of providers would 
have been supplied by another search ser- 
vice, et cetera. These activities formed 
chains in which a large amount of data 
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distributed across the Web (and almost 
worthless in that form) was progressive- 
ly reduced to the small amount of data of 
high value to Pete and Lucy—a plan of 
appointments to fit their schedules and 
other requirements. 

In the next step, the Semantic Web will 
break out of the virtual realm and extend 
into our physical world. URIs can point to 
anything, including physical entities, 
which means we can use the RDF lan- 
guage to describe devices such as cell 
phones and TVs. Such devices can adver- 
tise their functionality—what they can do 
and how they are controlled—much like 
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AFTER WE GIVE a presentation about the Semantic Web, we’re often asked, “Okay, so what is the killer application of the Semantic 
Web?” The “killer app” of any technology, of course, is the application that brings a user to investigate the technology and start 
using it. The transistor radio was a killer app of transistors, and the cell phone is a killer app of wireless technology. 
So what do we answer? “The Semantic Web is the killer app.” 
At this point we're likely to be told we’re crazy, so we ask a question in turn: “Well, what’s the killer app of the World Wide Web?” 
Now we're being stared at kind of fish-eyed, so we answer ourselves: “The Web is the killer app of the Internet. The Semantic Web is 


another killer app of that magnitude.” 


The point here is that the abilities of the Semantic Web are too general to be thought about in terms of solving one key problem 
or creating one essential gizmo. It will have uses we haven't dreamed of. 

Nevertheless, we can foresee some disarming (if not actually killer) apps that will drive initial use. Online catalogs with 
semantic markup will benefit both buyers and sellers. Electronic commerce transactions will be easier for small businesses to set 
up securely with greater autonomy. And one final example: you make reservations for an extended trip abroad. The airlines, hotels, 
soccer stadiums and so on return confirmations with semantic markup. All the schedules load directly into your date book and all 
the expenses directly into your accounting program, no matter what semantics-enabled software you use. No more laborious 
cutting and pasting from e-mail. No need for all the businesses to supply the data in half a dozen different formats or to create and 


impose their own standard format. 


software agents. Being much more flexible 
than low-level schemes such as Universal 
Plug and Play, such a semantic approach 
opens up a world of exciting possibilities. 

For instance, what today is called 
home automation requires careful config- 
uration for appliances to work together. 
Semantic descriptions of device capabili- 
ties and functionality will let us achieve 
such automation with minimal human in- 
tervention. A trivial example occurs when 
Pete answers his phone and the stereo 
sound is turned down. Instead of having 
to program each specific appliance, he 
could program such a function once and 
for all to cover every local device that ad- 
vertises having a volume control—the 
TV, the DVD player and even the media 
players on the laptop that he brought 
home from work this one evening. 

The first concrete steps have already 
been taken in this area, with work on de- 
veloping a standard for describing func- 
tional capabilities of devices (such as 
screen sizes) and user preferences. Built 
on RDF, this standard is called Compos- 
ite Capability/Preference Profile (CC/PP). 
Initially it will let cell phones and other 
nonstandard Web clients describe their 
characteristics so that Web content can 
be tailored for them on the fly. Later, 
when we add the full versatility of lan- 
guages for handling ontologies and log- 
ic, devices could automatically seek out 
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and employ services and other devices for 
added information or functionality. It is 
not hard to imagine your Web-enabled 
microwave oven consulting the frozen- 
food manufacturer’s Web site for opti- 
mal cooking parameters. 


Evolution of Knowledge 

THE SEMANTIC WEB is not “merely” the 
tool for conducting individual tasks that 
we have discussed so far. In addition, if 
properly designed, the Semantic Web can 
assist the evolution of human knowledge 
as a whole. 

Human endeavor is caught in an eter- 
nal tension between the effectiveness of 
small groups acting independently and 
the need to mesh with the wider commu- 
nity. A small group can innovate rapidly 
and efficiently, but this produces a sub- 
culture whose concepts are not under- 
stood by others. Coordinating actions 
across a large group, however, is painful- 
ly slow and takes an enormous amount 
of communication. The world works 
across the spectrum between these ex- 
tremes, with a tendency to start small— 
from the personal idea—and move to- 
ward a wider understanding over time. 

An essential process is the joining to- 
gether of subcultures when a wider com- 
mon language is needed. Often two groups 
independently develop very similar con- 
cepts, and describing the relation between 
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them brings great benefits. Like a Finnish- 
English dictionary, or a weights-and-mea- 
sures conversion table, the relations allow 
communication and collaboration even 
when the commonality of concept has not 
(yet) led to a commonality of terms. 

The Semantic Web, in naming every 
concept simply by a URI, lets anyone ex- 
press new concepts that they invent with 
minimal effort. Its unifying logical lan- 
guage will enable these concepts to be 
progressively linked into a universal Web. 
This structure will open up the knowl- 
edge and workings of humankind to 
meaningful analysis by software agents, 
providing a new class of tools by which 
we can live, work and learn together. 


MORE TO EXPLORE 


Weaving the Web: The Original Design and 
Ultimate Destiny of the World Wide Web by Its 
Inventor. 

Tim Berners-Lee, with Mark Fischetti. Harper San 
Francisco, 1999. 

World Wide Web Consortium (W3C): www.w3.org/ 
W3C Semantic Web Activity: www.w3.org/2001/sw/ 
An introduction to ontologies: 
www.SemanticWeb.org/knowmarkup.html 


Simple HTML Ontology Extensions Frequently 
Asked Questions (SHOE FAQ): 
www.cs.umd.edu/projects/plus/SHOE/faq.html 


DARPA Agent Markup Language (DAML) home page: 
www.daml.org/ 
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By David P. Anderson 


and John Kubiatowicz 


When Mary gets home 


from work and goes to her PC to check e-mail, the PC isn’t just 
sitting there. It’s working for a biotech company, matching gene 
sequences to a library of protein molecules. Its DSL connection 
is busy downloading a block of radio telescope data to be ana- 
lyzed later. Its disk contains, in addition to Mary’s own files, 
encrypted fragments of thousands of other files. Occasionally 
one of these fragments is read and transmitted; it’s part of a 
movie that someone is watching in Helsinki. Then Mary moves 
the mouse, and this activity abruptly stops. Now the PC and its 
network connection are all hers. 

This sharing of resources doesn’t stop at her desktop com- 
puter. The laptop computer in her satchel is turned off, but its 
disk is filled with bits and pieces of other people’s files, as part 
of a distributed backup system. Mary’s critical files are backed 
up in the same way, saved on dozens of disks around the world. 

Later, Mary watches an independent film on her Internet- 
connected digital television, using a pay-per-view system. The 
movie is assembled on the fly from fragments on several hun- 
dred computers belonging to people like her. 

Mary’s computers are moonlighting for other people. But 
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An operating system 
spanning the Internet 
would bring the power of 
millions of the world’s 
Internet-connected PCs 
to everyone s fingertips 


they’re not giving anything away for free. As her PC works, 
pennies trickle into her virtual bank account. The payments 
come from the biotech company, the movie system and the 
backup service. Instead of buying expensive “server farms,” 
these companies are renting time and space, not just on Mary’s 
two computers but on millions of others as well. It’s a win-win 
situation. The companies save money on hardware, which en- 
ables, for instance, the movie-viewing service to offer obscure 
movies. Mary earns a little cash, her files are backed up, and 
she gets to watch an indie film. All this could happen with an 
Internet-scale operating system (ISOS) to provide the neces- 
sary “glue” to link the processing and storage capabilities of 
millions of independent computers. 


Internet-Scale Applications 

ALTHOUGH MARY’S WORLD is fictional—and an Internet- 
scale operating system does not yet exist—developers have al- 
ready produced a number of Internet-scale, or peer-to-peer, 
applications that attempt to tap the vast array of underutilized 
machines available through the Internet [see box on page 42]. 
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These applications accomplish goals that 
would be difficult, unaffordable or im- 
possible to attain using dedicated com- 
puters. Further, today’s systems are just 
the beginning: we can easily conceive of 
archival services that could be relied on 
for hundreds of years and intelligent 


ing challenge. Developers must build 
each new application from the ground 
up, with much effort spent on technical 
matters, such as maintaining a database 
of users, that have little to do with the ap- 
plication itself. If Internet-scale applica- 
tions are to become mainstream, these in- 


a virtual computing environment in which 
programs operate as if they were in sole 
possession of the computer. It shields 
programmers from the painful details of 
memory and disk allocation, communi- 
cation protocols, scheduling of myriad 
processes, and interfaces to devices for 


More than 150 MILLION hosts are connected to the 
Internet, and the number is GROWING exponentially. 


search engines for tomorrow’s Semantic 
Web [see “The Semantic Web,” by Tim 
Berners-Lee, James Hendler and Ora Las- 
sila; SCIENTIFIC AMERICAN, May 2001]. 

Unfortunately, the creation of Inter- 
net-scale applications remains an impos- 


frastructure issues must be dealt with 
once and for all. 

We can gain inspiration for eliminat- 
ing this duplicate effort from operating 
systems such as Unix and Microsoft 
Windows. An operating system provides 


data input and output. An operating sys- 
tem greatly simplifies the development of 
new computer programs. Similarly, an 
Internet-scale operating system would 
simplify the development of new distrib- 
uted applications. 


Existing Distributed Systems 


COMPUTING 


GIMPS (Great Internet Mersenne Prime Search): 
www.mersenne.org/ 

Searches for large prime numbers. About 130,000 people are 
signed up, and five new primes have been found, including the 
largest prime known, which has four million digits. 


distributed.net: www.distributed.net/ 

Has decrypted several messages by using brute-force searches 
through the space of possible encryption keys. More than 100 
billion keys are tried each second on its current decryption 
project. Also searches for sets of numbers called optimal Golomb 
rulers, which have applications in coding and communications. 


SETI@home (Search for Extraterrestrial Intelligence): 
http://setiathome.berkeley.edu/ 

Analyzes radio telescope data, searching for signals of 
extraterrestrial origin. A total of 3.4 million users have devoted 
more than 800,000 years of processor time to the task. 


folding@home: http://folding.stanford.edu/ 

Run by Vijay Pande’s group in the chemistry department at 
Stanford University, this project has about 20,000 computers 
performing molecular-dynamics simulations of how proteins 
fold, including the folding of Alzheimer amyloid-beta protein. 


Intel/United Devices cancer research project: 
http://members.ud.com/projects/cancer/ 

Searches for possible cancer drugs by testing which of 3.5 billion 
molecules are best shaped to bind to any one of eight proteins 
that cancers need to grow. 
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STORAGE 


Napster: www.napster.com/ 

Allowed users to share digital music. A central database stored the 
locations of all files, but data were transferred directly between 
user systems. Songwriters and music publishers brought a class- 
action lawsuit against Napster. The parties reached an agreement 
whereby rights to the music would be licensed to Napster and 
artists would be paid, but the new fee-based service had not 
started as of January 2002. 


Gnutella: www.gnutella.com/ 

Provides a private, secure shared file system. There is no central 
server; instead a request for a file is passed from each computer 
to all its neighbors. 


Freenet: http://freenetproject.org/ 

Offers a similar service to Gnutella but uses a better file-location 
protocol. Designed to keep file requesters and suppliers anony- 

mous and to make it difficult for a host owner to determine or be 
held responsible for the Freenet files stored on his computer. 


Mojo Nation: www.mojonation.net/ 

Also similar to Gnutella, but files are broken into small pieces 

that are stored on different computers to improve the rate at 

which data can be uploaded to the network. A virtual payment 
system encourages users to provide resources. 


Fasttrack P2P Stack: www.fasttrack.nu/ 

A peer-to-peer system in which more powerful computers become 
search hubs as needed. This software underlies the Grokster, 
MusicCity (“Morpheus”) and KaZaA file-sharing services. 
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An ISOS consists of a thin layer of 
software (an ISOS agent) that runs on 
each “host” computer (such as Mary’s) 
and a central coordinating system that 
runs on one or more ISOS server com- 
plexes. This veneer of software would 
provide only the core functions of allo- 
cating and scheduling resources for each 
task, handling communication among 
host computers and determining the re- 
imbursement required for each machine. 
This type of operating system, called a 
microkernel, relegates higher-level func- 
tions to programs that make use of the 
operating system but are not a part of it. 
For instance, Mary would not use the 
ISOS directly to save her files as pieces 
distributed across the Internet. She might 
run a backup application that used ISOS 
functions to do that for her. The ISOS 
would use principles borrowed from eco- 
nomics to apportion computing re- 
sources to different users efficiently and 
fairly and to compensate the owners of 
the resources. 

Two broad types of applications might 
benefit from an ISOS. The first is distrib- 
uted data processing, such as physical 
simulations, radio signal analysis, genet- 
ic analysis, computer graphics rendering 
and financial modeling. The second is dis- 
tributed online services, such as file stor- 
age systems, databases, hosting of Web 
sites, streaming media (such as online 
video) and advanced Web search engines. 


What’s Mine Is Yours 
COMPUTING TODAY operates pre- 
dominantly as a private resource; orga- 
nizations and individuals own the sys- 
tems that they use. An ISOS would facil- 
itate a new paradigm in which it would 
be routine to make use of resources all 
across the Internet. The resource pool— 
hosts able to compute or store data and 
networks able to transfer data between 
hosts—would still be individually owned, 
but they could work for anyone. Hosts 
would include desktops, laptops, server 
computers, network-attached storage de- 
vices and maybe handheld devices. 

The Internet resource pool differs 
from private resource pools in several im- 
portant ways. More than 150 million 
hosts are connected to the Internet, and 


The Future of the Web 


MOONLIGHTING COMPUTERS 


With Internet-scale applications, PCs around the world can work during times when 
they would otherwise sit idle. Here’s how it works: 


. Mary’s home computer works while she’s 
away. It’s one of millions of PCs 
that are crunching data and delivering 
file fragments for the network. 


3. Her laptop stores backup copies 
of encrypted fragments of other users’ 
files. The laptop is connected only 
occasionally, but that suffices. 


4. When Mary gets back on her PC, the work 
for the network is automatically suspended. 


the number is growing exponentially. 
Consequently, an ISOS could provide a 
virtual computer with potentially 150 
million times the processing speed and 
storage capacity of a typical single com- 
puter. Even when this virtual computer is 
divided up among many users, and after 
one allows for the overhead of running 


1. An Internet-scale operating system (ISOS) 


coordinates all the participating 
computers and pays them for their work. 


5. Later, Mary watches an obscure indie 
movie that is consolidated from file 
fragments delivered by the network. 


the network, the result is a bigger, faster 
and cheaper computer than the users 
could own privately. Continual upgrad- 
ing of the resource pool’s hardware caus- 
es the total speed and capacity of this 
uber-computer to increase even faster 
than the number of connected hosts. 
Also, the pool is self-maintaining: when 
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HOW A DISTRIBUTED SERVICE WOULD OPERATE 
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Program 
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1. Acme Movie Service 


wants to distribute 
movies to viewers. Acme 
requests hosts from the 
ISOS server, the system’s 
traffic cop. The ISOS 
server sends a host list, 
for which Acme pays. 


Movie Service | Movie File t 
a Agents | Fragments of 


3. Acme sends its movie 
service agent program to 
the hosts. Acme splits its 
movie into fragments and 
also sends them to hosts. 


5. Movie service instructs its agents 
to send Mary the movie. ISOS pays 
the hosts for their work. 


7. The movie is assembled, 
and Mary is free to 
enjoy her Acme movie. 


a computer breaks down, its owner even- 
tually fixes or replaces it. 

Extraordinary parallel data transmis- 
sion is possible with the Internet resource 
pool. Consider Mary’s movie, being up- 
loaded in fragments from perhaps 200 
hosts. Each host may be a PC connected 
to the Internet by an antiquated 56k mo- 
dem—far too slow to show a high-quali- 
ty video—but combined they could deliv- 
er 10 megabits a second, better than a ca- 
ble modem. Data stored in a distributed 
system are available from any location 
(with appropriate security safeguards) 
and can survive disasters that knock out 
sections of the resource pool. Great secu- 
rity is also possible, with systems that 
could not be compromised without break- 
ing into, say, 10,000 computers. 

In this way, the Internet-resource par- 
adigm can increase the bounds of what is 
possible (such as higher speeds or larger 
data sets) for some applications, where- 
as for others it can lower the cost. For 
certain applications it may do neither— 
it’s a paradigm, not a panacea. And de- 
signing an ISOS also presents a number 
of obstacles. 
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Some characteristics of the resource 
pool create difficulties that an ISOS must 
deal with. The resource pool is heteroge- 
neous: Hosts have different processor 
types and operating systems. They have 
varying amounts of memory and disk 
space and a wide range of Internet con- 
nection speeds. Some hosts are behind 
firewalls or other similar layers of soft- 
ware that prohibit or hinder incoming 
connections. Many hosts in the pool are 
available only sporadically; desktop PCs 
are turned off at night, and laptops and 
systems using modems are frequently not 
connected. Hosts disappear unpredict- 
ably—sometimes permanently—and new 
hosts appear. 

The ISOS must also take care not to 
antagonize the owners of hosts. It must 
have a minimal impact on the non-ISOS 
uses of the hosts, and it must respect lim- 
itations that owners may impose, such as 
allowing a host to be used only at night 
or only for specific types of applications. 
Yet the ISOS cannot trust every host to 
play by the rules in return for its own 
good behavior. Owners can inspect and 
modify the activities of their hosts. Cu- 
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rious and malicious users may attempt to 
disrupt, cheat or spoof the system. All 
these problems have a major influence on 
the design of an ISOS. 


Who Gets What? 
AN INTERNET-SCALE operating sys- 
tem must address two fundamental is- 
sues—how to allocate resources and how 
to compensate resource suppliers. A 
model based on economic principles in 
which suppliers lease resources to con- 
sumers can deal with both issues at once. 
In the 1980s researchers at Xerox PARC 
proposed and analyzed economic ap- 
proaches to apportioning computer re- 
sources. More recently, Mojo Nation de- 
veloped a file-sharing system in which 
users are paid in a virtual currency 
(“mojo”) for use of their resources and 
they in turn must pay mojo to use the sys- 
tem. Such economic models encourage 
owners to allow their resources to be 
used by other organizations, and theory 
shows that they lead to optimal alloca- 
tion of resources. 

Even with 150 million hosts at its dis- 
posal, the ISOS will be dealing in “scarce” 
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resources, because some tasks will request 
and be capable of using essentially un- 
limited resources. As it constantly decides 
where to run data-processing jobs and 
how to allocate storage space, the ISOS 
must try to perform tasks as cheaply as 
possible. It must also be fair, not allowing 
one task to run efficiently at the expense 
of another. Making these criteria pre- 
cise—and devising scheduling algorithms 
to achieve them, even approximately— 
are areas of active research. 

The economic system for a shared 
network must define the basic units of a 


Researchers have explored statistical 
methods for detecting malicious or mal- 
functioning hosts. A recent idea for pre- 
venting unearned computation credit is 
to ensure that each work unit has a num- 
ber of intermediate results that the serv- 
er can quickly check and that can be ob- 
tained only by performing the entire 
computation. Other approaches are need- 
ed to prevent fraud in data storage and 
service provision. 

The cost of ISOS resources to end 
users will converge to a fraction of the 
cost of owning the hardware. Ideally, this 


fraction will be large enough to encour- 
age owners to participate and small 
enough to make many Internet-scale ap- 
plications economically feasible. A typi- 
cal PC owner might see the system as a 
barter economy in which he gets free ser- 
vices, such as file backup and Web host- 
ing, in exchange for the use of his other- 
wise idle processor time and disk space. 


A Basic Architecture 

WE ADVOCATE two basic principles in 
our ISOS design: a minimal core operat- 
ing system and control by central servers. 


Curious and malicious USERS may attempt to 
DISRUPT, CHEAT or spoof the system. 


resource, such as the use of a megabyte of 
disk space for a day, and assign values 
that take into account properties such as 
the rate, or bandwidth, at which the stor- 
age can be accessed and how frequently 
it is available to the network. The system 
must also define how resources are 
bought and sold (whether they are paid 
for in advance, for instance) and how 
prices are determined (by auction or bya 
price-setting middleman). 

Within this framework, the ISOS must 
accurately and securely keep track of re- 
source usage. The ISOS would have an 
internal bank with accounts for suppliers 
and consumers that it must credit or deb- 
it according to resource usage. Partici- 
pants can convert between ISOS curren- 
cy and real money. The ISOS must also 
ensure that any guarantees of resource 
availability can be met: Mary doesn’t 
want her movie to grind to a halt part- 
way through. The economic system lets 
resource suppliers control how their re- 
sources are used. For example, a PC 
owner might specify that her computer’s 
processor can’t be used between 9 A.M. 
and 5 P.M. unless a very high price is paid. 

Money, of course, encourages fraud, 
and ISOS participants have many ways 
to try to defraud one another. For in- 
stance, resource sellers, by modifying or 
fooling the ISOS agent program running 
on their computer, may return fictitious 
results without doing any computation. 
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Primes and Crimes 


By Graham P. Collins 


NO ONE HAS SEEN signs of extraterrestrials using a distributed computation project 
(yet), but people have found the largest-known prime numbers, five-figure reward 


money—and legal trouble. 


The Great Internet Mersenne Prime Search (GIMPS), operating since 1996, has 
turned up five extremely large prime numbers so far. The fifth and largest was 
discovered in November 2001 by 20-year-old Michael Cameron of Owen Sound, Ontario. 
Mersenne primes can be expressed as 2°— 1, where Pis itself a prime number. 
Cameron’s is 21346691? 4, which would take four million digits to write out. His 
computer spent 45 days discovering that his number is a prime; altogether the GIMPS 
network expended 13,000 years of computer time eliminating other numbers that could 


have been the 39th Mersenne. 


The 38th Mersenne prime, a mere two million digits long, earned its discoverer 
(Nayan Hajratwala of Plymouth, Mich.) a $50,000 reward for being the first prime with 
more than a million digits. A prime with 10 million digits will win someone $100,000. 

A Georgia computer technician, on the other hand, has found nothing but trouble 
through distributed computation. In 1999 David McOwen installed the client program for 
the “distributed.net” decryption project on computers in seven offices of the DeKalb 
Technical Institute, along with Y2K upgrades. During the Christmas holidays, the 
computers’ activity was noticed, including small data uploads and downloads each day. 
In January 2000 McOwen was suspended, and he resigned soon thereafter. 

Case closed? Case barely opened: The Georgia Bureau of Investigation spent 18 
months investigating McOwen as a computer criminal, and in October 2001 he was 
charged with eight felonies under Georgia’s computer crime law. The one count of 
computer theft and seven counts of computer trespass each carry a $50,000 fine and 
up to 15 years in prison. On January 17, a deal was announced whereby McOwen will 
serve one year of probation, pay $2,100 in restitution and perform 80 hours of 
community service unrelated to computers or technology. 


Graham P. Collins is a staff writer and editor. 


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC. 


SCIENTIFIC AMERICAN SPECIAL ONLINE ISSUE 36 


WHAT AN INTERNET-SCALE OPERATING SYSTEM COULD DO 


3-D rendering 
and animation 


File backup and 
archiving for 
hundreds of years 


By harnessing the massive unused computing resources of the global network, an ISOS would make short 
work of daunting number-crunching tasks and data storage. Here are just a few of the possibilities: 


Computer 
Network 


of celestial 
radio signals 


—\ Matching gene 
/ | sequences 


Streaming media 
pay-per-view service 


A computer operating system that pro- 
vides only core functions is called a micro- 
kernel. Higher-level functions are built on 
top of it as user programs, allowing them 
to be debugged and replaced more easi- 
ly. This approach was pioneered in acad- 
emic research systems and has influenced 
some commercial systems, such as Win- 
dows NT. Most well-known operating 
systems, however, are not microkernels. 

The core facilities of an ISOS include 
resource allocation (long-term assign- 
ment of hosts’ processing power and 
storage), scheduling (putting jobs into 
queues, both across the system and with- 
in individual hosts), accounting of re- 
source usage, and the basic mechanisms 
for distributing and executing applica- 
tion programs. The ISOS should not du- 
plicate features of local operating systems 
running on hosts. 

The system should be coordinated by 
servers operated by the ISOS provider, 
which could be a government-funded or- 
ganization or a consortium of companies 
that are major resource sellers and buy- 
ers. (One can imagine competing ISOS 
providers, but we will keep things simple 
and assume a unique provider.) Central- 
ization runs against the egalitarian ap- 
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proach popular in some peer-to-peer sys- 
tems, but central servers are needed to 
ensure privacy of sensitive data, such as 
accounting data and other information 
about the resource hosts. Centralization 
might seem to require a control system 
that will become excessively large and 
unwieldy as the number of ISOS-con- 
nected hosts increases, and it appears to 
introduce a bottleneck that will choke the 
system anytime it is unavailable. These 
fears are unfounded: a reasonable num- 
ber of servers can easily store informa- 
tion about every Internet-connected host 
and communicate with them regularly. 
Napster, for example, handled almost 60 
million clients using a central server. Re- 
dundancy can be built into the server 
complex, and most ISOS online services 
can continue operating even with the 
servers temporarily unavailable. 

The ISOS server complex would 
maintain databases of resource descrip- 
tions, usage policies and task descrip- 
tions. The resource descriptions include, 
for example, the host’s operating system, 
processor type and speed, total and free 
disk space, memory space, performance 
statistics of its network connections, and 
statistical descriptions of when it is pow- 


COPYRIGHT 2002 SCIENTIFIC AMERICAN, INC. 


ered on and connected to the network. 
Usage policies spell out the rules an own- 
er has dictated for using her resources. 
Task descriptions include the resources 
assigned to an online service and the 
queued jobs of a data-processing task. 
To make their computers available to 
the network, resource sellers contact the 
server complex (for instance, through a 
Web site) to download and install an 
ISOS agent program, to link resources to 
their ISOS account, and so on. The ISOS 
agent manages the host’s resource usage. 
Periodically it obtains from the ISOS 
server complex a list of tasks to perform. 
Resource buyers send the servers task 
requests and application agent programs 
(to be run on hosts). An online service 
provider can ask the ISOS for a set of 
hosts on which to run, specifying its re- 
source requirements (for example, a dis- 
tributed backup service could use spo- 
radically connected resource hosts— 
Mary’s laptop—which would cost less 
than constantly connected hosts). The 
ISOS supplies the service with addresses 
and descriptions of the granted hosts and 
allows the application agent program to 
communicate directly between hosts on 
which it is running. The service can re- 
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quest new hosts when some become un- 
available. The ISOS does not dictate how 
clients make use of an online service, how 
the service responds or how clients are 
charged by the service (unlike the ISOS- 
controlled payments flowing from re- 
source users to host owners). 


An Application Toolkit 

IN PRINCIPLE, the basic facilities of the 
ISOS—resource allocation, scheduling 
and communication—are sufficient to 
construct a wide variety of applications. 
Most applications, however, will have 
important subcomponents in common. It 
is useful, therefore, to have a software 


data facility aids in this task with mech- 
anisms for encoding, reconstructing and 
repairing data. For maximum survivabil- 
ity, data are encoded with an “m-of-n” 
code. An m-of-n code is similar in princi- 
ple to a hologram, from which a small 
piece suffices for reconstructing the 
whole image. The encoding spreads in- 
formation over 1 fragments (on 7 re- 
source hosts), any m of which are suffi- 
cient to reconstruct the data. For instance, 
the facility might encode a document into 
64 fragments, any 16 of which suffice to 
reconstruct it. Continuous repair is also 
important. As fragments fail, the repair 
facility would regenerate them. If prop- 


them are trying to lead the process astray. 

Other facilities. The toolkit also assists 
by providing additional facilities, such as 
format conversion (to handle the hetero- 
geneous nature of hosts) and synchro- 
nization libraries (to aid in cooperation 
among hosts). 

An ISOS suffers from a familiar 
catch-22 that slows the adoption of 
many new technologies: Until a wide user 
base exists, only a limited set of applica- 
tions will be feasible on the ISOS. Con- 
versely, as long as the applications are 
few, the user base will remain small. But 
if a critical mass can be achieved by con- 
vincing enough developers and users of 


A typical PC owner might see the system as a 


BARTER ECONOMY that provides free services in 
exchange for PROCESSOR TIME and DISK SPACE. 


toolkit to further assist programmers in 
building new applications. Code for these 
facilities will be incorporated into appli- 
cations on resource hosts. Examples of 
these facilities include: 

Location independent routing. Applica- 
tions running with the ISOS can spread 
copies of information and instances of 
computation among millions of resource 
hosts. They have to be able to access 
them again. To facilitate this, applica- 
tions name objects under their purview 
with Globally Unique Identifiers (GUIDs). 
These names enable “location indepen- 
dent routing,” which is the ability to send 
queries to objects without knowing their 
location. A simplistic approach to loca- 
tion independent routing could involve a 
database of GUIDs on a single machine, 
but that system is not amenable to han- 
dling queries from millions of hosts. In- 
stead the ISOS toolkit distributes the 
database of GUIDs among resource 
hosts. This kind of distributed system is 
being explored in research projects such 
as the OceanStore persistent data storage 
project at the University of California at 
Berkeley. 

Persistent data storage. Information 
stored by the ISOS must be able to sur- 
vive a variety of mishaps. The persistent 
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erly constructed, a persistent data facili- 
ty could preserve information for hun- 
dreds of years. 

Secure update. New problems arise 
when applications need to update stored 
information. For example, all copies of the 
information must be updated, and the ob- 
ject’s GUID must point to its latest copy. 
An access control mechanism must pre- 
vent unauthorized persons from updat- 
ing information. The secure update facil- 
ity relies on Byzantine agreement proto- 
cols, in which a set of resource hosts come 
to a correct decision, even if a third of 


the intrinsic usefulness of an ISOS, the 
system should grow rapidly. 

The Internet remains an immense un- 
tapped resource. The revolutionary rise in 
popularity of the World Wide Web has 
not changed that—it has made the re- 
source pool all the larger. An Internet- 
scale operating system would free pro- 
grammers to create applications that 
could run on this World Wide Computer 
without worrying about the underlying 
hardware. Who knows what will result? 
Mary and her computers will be doing 
things we haven’t even imagined. 
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Many research projects are working toward an Internet-scale operating system, including: 


Chord: www.pdos.|Ics.mit.edu/chord/ 
Cosm: www. mithral.com/projects/cosm/ 


Eurogrid: www.eurogrid.org/ 


Farsite: http://research.microsoft.com/sn/farsite/ 


Grid Physics Network (Griphyn): www.griphyn.org/ 


OceanStore: http://oceanstore.cs.berkeley.edu/ 
Particle Physics Data Grid: www.ppdg.net/ 


Pastry: www.research.microsoft.com/~antr/pastry/ 


Tapestry: www.cs.berkeley.edu/~ravenben/tapestry/ 
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